2,474 research outputs found
Applications of Machine Learning in Human Microbiome Studies: A Review on Feature Selection, Biomarker Identification, Disease Prediction and Treatment
COST Action CA18131
Cierva Grant IJC2019-042188-I (LM-Z)
Estonian Research Council grant PUT 1371The number of microbiome-related studies has notably increased the availability of data on human microbiome composition and function. These studies provide the essential material to deeply explore host-microbiome associations and their relation to the development and progression of various complex diseases. Improved data-analytical tools are needed to exploit all information from these biological datasets, taking into account the peculiarities of microbiome data, i.e., compositional, heterogeneous and sparse nature of these datasets. The possibility of predicting host-phenotypes based on taxonomy-informed feature selection to establish an association between microbiome and predict disease states is beneficial for personalized medicine. In this regard, machine learning (ML) provides new insights into the development of models that can be used to predict outputs, such as classification and prediction in microbiology, infer host phenotypes to predict diseases and use microbial communities to stratify patients by their characterization of state-specific microbial signatures. Here we review the state-of-the-art ML methods and respective software applied in human microbiome studies, performed as part of the COST Action ML4Microbiome activities. This scoping review focuses on the application of ML in microbiome studies related to association and clinical use for diagnostics, prognostics, and therapeutics. Although the data presented here is more related to the bacterial community, many algorithms could be applied in general, regardless of the feature type. This literature and software review covering this broad topic is aligned with the scoping review methodology. The manual identification of data sources has been complemented with: (1) automated publication search through digital libraries of the three major publishers using natural language processing (NLP) Toolkit, and (2) an automated identification of relevant software repositories on GitHub and ranking of the related research papers relying on learning to rank approach.publishersversionpublishe
Robust Identification of Target Genes and Outliers in Triple-negative Breast Cancer Data
Correct classification of breast cancer sub-types is of high importance as it
directly affects the therapeutic options. We focus on triple-negative breast
cancer (TNBC) which has the worst prognosis among breast cancer types. Using
cutting edge methods from the field of robust statistics, we analyze Breast
Invasive Carcinoma (BRCA) transcriptomic data publicly available from The
Cancer Genome Atlas (TCGA) data portal. Our analysis identifies statistical
outliers that may correspond to misdiagnosed patients. Furthermore, it is
illustrated that classical statistical methods may fail in the presence of
these outliers, prompting the need for robust statistics. Using robust sparse
logistic regression we obtain 36 relevant genes, of which ca. 60\% have been
previously reported as biologically relevant to TNBC, reinforcing the validity
of the method. The remaining 14 genes identified are new potential biomarkers
for TNBC. Out of these, JAM3, SFT2D2 and PAPSS1 were previously associated to
breast tumors or other types of cancer. The relevance of these genes is
confirmed by the new DetectDeviatingCells (DDC) outlier detection technique. A
comparison of gene networks on the selected genes showed significant
differences between TNBC and non-TNBC data. The individual role of FOXA1 in
TNBC and non-TNBC, and the strong FOXA1-AGR2 connection in TNBC stand out. Not
only will our results contribute to the breast cancer/TNBC understanding and
ultimately its management, they also show that robust regression and outlier
detection constitute key strategies to cope with high-dimensional clinical data
such as omics data
Classification and biomarker selection in lower-grade glioma using robust sparse logistic regression applied to RNA-seq data
Funding Information: This work was supported by national funds through Fundação para a Ciência e a Tecnologia (FCT) with references CEECINST/00102/2018, UIDB/00297/2020 and UIDB/00297/2020 (NOVA MATH, Center for Mathematics and Applications), UIDB/04516/2020 (NOVA LINCS), and the research project “MONET – Multi-omic networks in gliomas” (PTDC/CCI-BIO/4180/2020). The results presented are based upon data generated by the TCGA Research Network: https://www.cancer. gov/tcga. Publisher Copyright: © Brazilian Journal of Biometrics.Effective diagnosis and treatment in cancer is a barrier for the development of personalized medicine, mostly due to tumor heterogeneity. In the particular case of gliomas, highly heterogeneous brain tumors at the histological, cellular and molecular levels, and exhibiting poor prognosis, the mechanisms behind tumor heterogeneity and progression remain poorly understood. The recent advances in biomedical high-throughput technologies have allowed the generation of large amounts of molecular information from the patients that combined with statistical and machine learning techniques can be used for the definition of glioma subtypes and targeted therapies, an invaluable contribution to disease understanding and effective management. In this work sparse and robust sparse logistic regression models with the elastic net penalty were applied to glioma RNA-seq data from The Cancer Genome Atlas (TCGA), to identify relevant tran-scriptomic features in the separation between lower-grade glioma (LGG) subtypes and identify putative outlying observations. In general, all classification models yielded good accuracies, selecting different sets of genes. Among the genes selected by the models, TXNDC12, TOMM20, PKIA, CARD8 and TAF12 have been reported as genes with relevant role in glioma development and progression. This highlights the suitability of the present approach to disclose relevant genes and fosters the biological validation of non-reported genes.publishersversionpublishe
Twiner: correlation-based regularization for identifying common cancer gene signatures
© The Author(s). 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.Background: Breast and prostate cancers are typical examples of hormone-dependent cancers, showing remarkable similarities at the hormone-related signaling pathways level, and exhibiting a high tropism to bone. While the identification of genes playing a specific role in each cancer type brings invaluable insights for gene therapy research by targeting disease-specific cell functions not accounted so far, identifying a common gene signature to breast and prostate cancers could unravel new targets to tackle shared hormone-dependent disease features, like bone relapse. This would potentially allow the development of new targeted therapies directed to genes regulating both cancer types, with a consequent positive impact in cancer management and health economics.
Results: We address the challenge of extracting gene signatures from transcriptomic data of prostate adenocarcinoma (PRAD) and breast invasive carcinoma (BRCA) samples, particularly estrogen positive (ER+), and androgen positive (AR+) triple-negative breast cancer (TNBC), using sparse logistic regression. The introduction of gene network information based on the distances between BRCA and PRAD correlation matrices is investigated, through the proposed twin networks recovery (twiner) penalty, as a strategy to ensure similarly correlated gene features in two diseases to be less penalized during the feature selection procedure.
Conclusions: Our analysis led to the identification of genes that show a similar correlation pattern in BRCA and PRAD transcriptomic data, and are selected as key players in the classification of breast and prostate samples into ER+ BRCA/AR+ TNBC/PRAD tumor and normal tissues, and also associated with survival time distributions. The results obtained are supported by the literature and are expected to unveil the similarities between the diseases, disclose common disease biomarkers, and help in the definition of new strategies for more effective therapies.This work was supported by national funds through Fundação para a Ciência e a Tecnologia (FCT) with references UID/EEA/50008/2019 (Instituto de Telecomunicações), UID/CEC/50021/2019 (INESC-ID), UID/EMS/50022/2019 (IDMEC, LAETA), PREDICT (PTDC/CCI-CIF/29877/2017), and PERSEIDS (PTDC/EMS-SIS/0642/2014).info:eu-repo/semantics/publishedVersio
Tcox: Correlation-based regularization applied to colorectal cancer survival data
This work was partially supported by national funds through Fundacao para a Ciencia e a Tecnologia (FCT) with references PD/BD/139146/2018, IF/00409/2014, UIDB/50021/2020 (INESC-ID), UIDB/50022/2020 (IDMEC), UIDB/04516/2020 (NOVA LINCS), and UIDB/00297/2020 (CMA) and projects PREDICT (PTDC/CCI-CIF/29877/2017) and MATISSE (DSAIPA/DS/0026/2019).Colorectal cancer (CRC) is one of the leading causes of mortality and morbidity in the world. Being a heterogeneous disease, cancer therapy and prognosis represent a significant challenge to medical care. The molecular information improves the accuracy with which patients are classified and treated since similar pathologies may show different clinical outcomes and other responses to treatment. However, the high dimensionality of gene expression data makes the selection of novel genes a problematic task. We propose TCox, a novel penalization function for Cox models, which promotes the selection of genes that have distinct correlation patterns in normal vs. tumor tissues. We compare TCox to other regularized survival models, Elastic Net, HubCox, and OrphanCox. Gene expression and clinical data of CRC and normal (TCGA) patients are used for model evaluation. Each model is tested 100 times. Within a specific run, eighteen of the features selected by TCox are also selected by the other survival regression models tested, therefore undoubtedly being crucial players in the survival of colorectal cancer patients. Moreover, the TCox model exclusively selects genes able to categorize patients into significant risk groups. Our work demonstrates the ability of the proposed weighted regularizer TCox to disclose novel molecular drivers in CRC survival by accounting for correlation-based network information from both tumor and normal tissue. The results presented support the relevance of network information for biomarker identification in high-dimensional gene expression data and foster new directions for the development of network-based feature selection methods in precision oncology.publishersversionpublishe
RObust Sparse ensemble for outlIEr detection and gene selection in cancer omics data
We thank Peter Segaert for providing his adapted code of the enetLTS method. The results presented here are in whole or part based upon data generated by the TCGA Research Network: https://www.cancer.gov/tcga .
Funded by Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy - EXC 2075 390740016 (AJ and NR).
Publisher Copyright:
© The Author(s) 2022.The extraction of novel information from omics data is a challenging task, in particular, since the number of features (e.g. genes) often far exceeds the number of samples. In such a setting, conventional parameter estimation leads to ill-posed optimization problems, and regularization may be required. In addition, outliers can largely impact classification accuracy. Here we introduce ROSIE, an ensemble classification approach, which combines three sparse and robust classification methods for outlier detection and feature selection and further performs a bootstrap-based validity check. Outliers of ROSIE are determined by the rank product test using outlier rankings of all three methods, and important features are selected as features commonly selected by all methods. We apply ROSIE to RNA-Seq data from The Cancer Genome Atlas (TCGA) to classify observations into Triple-Negative Breast Cancer (TNBC) and non-TNBC tissue samples. The pre-processed dataset consists of (Formula presented.) genes and more than (Formula presented.) samples. We demonstrate that ROSIE selects important features and outliers in a robust way. Identified outliers are concordant with the distribution of the commonly selected genes by the three methods, and results are in line with other independent studies. Furthermore, we discuss the association of some of the selected genes with the TNBC subtype in other investigations. In summary, ROSIE constitutes a robust and sparse procedure to identify outliers and important genes through binary classification. Our approach is ad hoc applicable to other datasets, fulfilling the overall goal of simultaneously identifying outliers and candidate disease biomarkers to the targeted in therapy research and personalized medicine frameworks.publishersversionpublishe
Satellite-based feature extraction and multivariate time-series prediction of biotoxin contamination in shellfish
Shellfish production constitutes an important sector for the economy of many
Portuguese coastal regions, yet the challenge of shellfish biotoxin
contamination poses both public health concerns and significant economic risks.
Thus, predicting shellfish contamination levels holds great potential for
enhancing production management and safeguarding public health. In our study,
we utilize a dataset with years of Sentinel-3 satellite imagery for marine
surveillance, along with shellfish biotoxin contamination data from various
production areas along Portugal's western coastline, collected by Portuguese
official control. Our goal is to evaluate the integration of satellite data in
forecasting models for predicting toxin concentrations in shellfish given
forecasting horizons up to four weeks, which implies extracting a small set of
useful features and assessing their impact on the predictive models. We framed
this challenge as a time-series forecasting problem, leveraging historical
contamination levels and satellite images for designated areas. While
contamination measurements occurred weekly, satellite images were accessible
multiple times per week. Unsupervised feature extraction was performed using
autoencoders able to handle non-valid pixels caused by factors like cloud
cover, land, or anomalies. Finally, several Artificial Neural Networks models
were applied to compare univariate (contamination only) and multivariate
(contamination and satellite data) time-series forecasting. Our findings show
that incorporating these features enhances predictions, especially beyond one
week in lagoon production areas (RIAV) and for the 1-week and 2-week horizons
in the L5B area (oceanic). The methodology shows the feasibility of integrating
information from a high-dimensional data source like remote sensing without
compromising the model's predictive ability.Comment: 19 page
The role of network science in glioblastoma
Network science has long been recognized as a well-established discipline across many biological domains. In the particular case of cancer genomics, network discovery is challenged by the multitude of available high-dimensional heterogeneous views of data. Glioblastoma (GBM) is an example of such a complex and heterogeneous disease that can be tackled by network science. Identifying the architecture of molecular GBM networks is essential to understanding the information flow and better informing drug development and pre-clinical studies. Here, we review network-based strategies that have been used in the study of GBM, along with the available software implementations for reproducibility and further testing on newly coming datasets. Promising results have been obtained from both bulk and single-cell GBM data, placing network discovery at the forefront of developing a molecularly-informed-based personalized medicine.This work was partially supported by national funds through Fundação para a Ciência e a
Tecnologia (FCT) with references CEECINST/00102/2018, CEECIND/00072/2018 and
PD/BDE/143154/2019, UIDB/04516/2020, UIDB/00297/2020, UIDB/50021/2020, UIDB/50022/2020,
UIDB/50026/2020, UIDP/50026/2020, NORTE-01-0145-FEDER-000013, and NORTE-01-0145-FEDER000023 and projects PTDC/CCI-BIO/4180/2020 and DSAIPA/DS/0026/2019. This project has received funding from the European Union’s Horizon 2020 research and innovation program under
Grant Agreement No. 951970 (OLISSIPO project)
Forecasting biotoxin contamination in mussels across production areas of the Portuguese coast with Artificial Neural Networks
Harmful algal blooms (HABs) and the consequent contamination of shellfish are complex processes depending on several biotic and abiotic variables, turning prediction of shellfish contamination into a challenging task. Not only the information of interest is dispersed among multiple sources, but also the complex temporal relationships between the time-series variables require advanced machine methods to model such relationships. In this study, multiple time-series variables measured in Portuguese shellfish production areas were used to forecast shellfish contamination by diarrhetic she-llfish poisoning (DSP) toxins one to four weeks in advance. These time series included DSP con-centration in mussels (Mytilus galloprovincialis), toxic phytoplankton cell counts, meteorological, and remotely sensed oceanographic variables. Several data pre-processing and feature engineering methods were tested, as well as multiple autoregressive and artificial neural network (ANN) models. The best results regarding the mean absolute error of prediction were obtained for a bivariate long short-term memory (LSTM) neural network based on biotoxin and toxic phytoplankton measurements, with higher accuracy for short-term forecasting horizons. When evaluating all ANNs model ability to predict the contamination state (below or above the regulatory limit for contamination) and changes to this state, multilayer perceptrons (MLP) and convolutional neural networks (CNN) yielded improved predictive performance on a case-by-case basis. These results show the possibility of extracting relevant information from time-series data from multiple sources which are predictive of DSP contamination in mussels, therefore placing ANNs as good candidate models to assist the production sector in anticipating harvesting interdictions and mitigating economic losses.(c) 2022 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).info:eu-repo/semantics/publishedVersio
A review of recent machine learning advances for forecasting harmful Algal Blooms and shellfish contamination
Harmful algal blooms (HABs) are among the most severe ecological marine problems worldwide. Under favorable climate and oceanographic conditions, toxin-producing microalgae species may proliferate, reach increasingly high cell concentrations in seawater, accumulate in shellfish, and threaten the health of seafood consumers. There is an urgent need for the development of effective tools to help shellfish farmers to cope and anticipate HAB events and shellfish contamination, which frequently leads to significant negative economic impacts. Statistical and machine learning forecasting tools have been developed in an attempt to better inform the shellfish industry to limit damages, improve mitigation measures and reduce production losses. This study presents a synoptic review covering the trends in machine learning methods for predicting HABs and shellfish biotoxin contamination, with a particular focus on autoregressive models, support vector machines, random forest, probabilistic graphical models, and artificial neural networks (ANN). Most efforts have been attempted to forecast HABs based on models of increased complexity over the years, coupled with increased multi-source data availability, with ANN architectures in the forefront to model these events. The purpose of this review is to help defining machine learning-based strategies to support shellfish industry to manage their harvesting/production, and decision making by governmental agencies with environmental responsibilities.CEECINST/00102/2018/ UIDB/04516/2020/ UIDB/00297/2020/ UIDB/50021/2020/ UID/Multi/04326/2020info:eu-repo/semantics/publishedVersio
- …