14 research outputs found

    Large-Scale Automatic Feature Selection for Biomarker Discovery in High-Dimensional OMICs Data

    Get PDF
    The identification of biomarker signatures in omics molecular profiling is usually performed to predict outcomes in a precision medicine context, such as patient disease susceptibility, diagnosis, prognosis, and treatment response. To identify these signatures, we have developed a biomarker discovery tool, called BioDiscML. From a collection of samples and their associated characteristics, i.e., the biomarkers (e.g., gene expression, protein levels, clinico-pathological data), BioDiscML exploits various feature selection procedures to produce signatures associated to machine learning models that will predict efficiently a specified outcome. To this purpose, BioDiscML uses a large variety of machine learning algorithms to select the best combination of biomarkers for predicting categorical or continuous outcomes from highly unbalanced datasets. The software has been implemented to automate all machine learning steps, including data pre-processing, feature selection, model selection, and performance evaluation. BioDiscML is delivered as a stand-alone program and is available for download at https://github.com/mickaelleclercq/BioDiscML

    Identification of a Transcriptomic Prognostic Signature by Machine Learning Using a Combination of Small Cohorts of Prostate Cancer

    No full text
    International audienceDetermining which treatment to provide to men with prostate cancer (PCa) is a major challenge for clinicians. Currently, the clinical risk-stratification for PCa is based on clinico-pathological variables such as Gleason grade, stage and prostate specific antigen (PSA) levels. But transcriptomic data have the potential to enable the development of more precise approaches to predict evolution of the disease. However, high quality RNA sequencing (RNA-seq) datasets along with clinical data with long follow-up allowing discovery of biochemical recurrence (BCR) biomarkers are small and rare. In this study, we propose a machine learning approach that is robust to batch effect and enables the discovery of highly predictive signatures despite using small datasets. Gene expression data were extracted from three RNA-Seq datasets cumulating a total of 171 PCa patients. Data were re-analyzed using a unique pipeline to ensure uniformity. Using a machine learning approach, a total of 14 classifiers were tested with various parameters to identify the best model and gene signature to predict BCR. Using a random forest model, we have identified a signature composed of only three genes (JUN, HES4, PPDPF) predicting BCR with better accuracy [74.2%, balanced error rate (BER) = 27%] than the clinico-pathological variables (69.2%, BER = 32%) currently in use to predict PCa evolution. This score is in the range of the studies that predicted BCR in single-cohort with a higher number of patients. We showed that it is possible to merge and analyze different small and heterogeneous datasets altogether to obtain a better signature than if they were analyzed individually, thus reducing the need for very large cohorts. This study demonstrates the feasibility to regroup different small datasets in one larger to identify a predictive genomic signature that would benefit PCa patients

    Investigation of the Genus Flavobacterium as a Reservoir for Fish-Pathogenic Bacterial Species: the Case of Flavobacterium collinsii

    No full text
    International audienceBacteria of the genus Flavobacterium are recovered from a large variety of environments. Among the described species, Flavobacterium psychrophilum and Flavobacterium columnare cause considerable losses in fish farms. Alongside these well-known fish-pathogenic species, isolates belonging to the same genus recovered from diseased or apparently healthy wild, feral, and farmed fish have been suspected to be pathogenic. Here, we report the identification and genomic characterization of a Flavobacterium collinsii isolate (TRV642) retrieved from rainbow trout spleen. A phylogenetic tree of the genus built by aligning the core genome of 195 Flavobacterium species revealed that F. collinsii stands within a cluster of species associated with diseased fish, the closest one being F. tructae, which was recently confirmed as pathogenic. We evaluated the pathogenicity of F. collinsii TRV642 as well as of Flavobacterium bernardetii F-372T, another recently described species reported as a possible emerging pathogen. Following intramuscular injection challenges in rainbow trout, no clinical signs or mortalities were observed with F. bernardetii. F. collinsii showed very low virulence but was isolated from the internal organs of survivors, indicating that the bacterium is able to survive inside the host and may provoke disease in fish under compromised conditions such as stress and/or wounds. Our results suggest that members of a phylogenetic cluster of fish-associated Flavobacterium species may be opportunistic fish pathogens causing disease under specific circumstances. IMPORTANCE Aquaculture has expanded significantly worldwide in the last decades and accounts for half of human fish consumption. However, infectious fish diseases are a major bottleneck for its sustainable development, and an increasing number of bacterial species from diseased fish raise a great concern. The current study revealed phylogenetic associations with ecological niches among the Flavobacterium species. We also focused on Flavobacterium collinsii, which belongs to a group of putative pathogenic species. The genome contents revealed a versatile metabolic repertoire suggesting the use of diverse nutrient sources, a characteristic of saprophytic or commensal bacteria. In a rainbow trout experimental challenge, the bacterium survived inside the host, likely escaping clearance by the immune system but without provoking massive mortality, suggesting opportunistic pathogenic behavior. This study highlights the importance of experimentally evaluating the pathogenicity of the numerous bacterial species retrieved from diseased fi

    Large-Scale Automatic Feature Selection for Biomarker Discovery in High-Dimensional OMICs Data

    No full text
    The identification of biomarker signatures in omics molecular profiling is usually performed to predict outcomes in a precision medicine context, such as patient disease susceptibility, diagnosis, prognosis, and treatment response. To identify these signatures, we have developed a biomarker discovery tool, called BioDiscML. From a collection of samples and their associated characteristics, i.e., the biomarkers (e.g., gene expression, protein levels, clinico-pathological data), BioDiscML exploits various feature selection procedures to produce signatures associated to machine learning models that will predict efficiently a specified outcome. To this purpose, BioDiscML uses a large variety of machine learning algorithms to select the best combination of biomarkers for predicting categorical or continuous outcomes from highly unbalanced datasets. The software has been implemented to automate all machine learning steps, including data pre-processing, feature selection, model selection, and performance evaluation. BioDiscML is delivered as a stand-alone program and is available for download at https://github.com/mickaelleclercq/BioDiscML

    Prediction of lipomatous soft tissue malignancy on MRI: comparison between machine learning applied to radiomics and deep learning

    No full text
    International audienceAbstract Objectives Malignancy of lipomatous soft-tissue tumours diagnosis is suspected on magnetic resonance imaging (MRI) and requires a biopsy. The aim of this study is to compare the performances of MRI radiomic machine learning (ML) analysis with deep learning (DL) to predict malignancy in patients with lipomas oratypical lipomatous tumours. Methods Cohort include 145 patients affected by lipomatous soft tissue tumours with histology and fat-suppressed gadolinium contrast-enhanced T1-weighted MRI pulse sequence. Images were collected between 2010 and 2019 over 78 centres with non-uniform protocols (three different magnetic field strengths (1.0, 1.5 and 3.0 T) on 16 MR systems commercialised by four vendors (General Electric, Siemens, Philips, Toshiba)). Two approaches have been compared: (i) ML from radiomic features with and without batch correction; and (ii) DL from images. Performances were assessed using 10 cross-validation folds from a test set and next in external validation data. Results The best DL model was obtained using ResNet50 (resulting into an area under the curve (AUC) of 0.87 ± 0.11 (95% CI 0.65−1). For ML/radiomics, performances reached AUCs equal to 0.83 ± 0.12 (95% CI 0.59−1) and 0.99 ± 0.02 (95% CI 0.95−1) on test cohort using gradient boosting without and with batch effect correction, respectively. On the external cohort, the AUC of the gradient boosting model was equal to 0.80 and for an optimised decision threshold sensitivity and specificity were equal to 100% and 32% respectively. Conclusions In this context of limited observations, batch-effect corrected ML/radiomics approaches outperformed DL-based models

    Transcriptome architecture and regulation at environmental transitions in flavobacteria: the case of an important fish pathogen

    No full text
    International audienceThe family Flavobacteriaceae (phylum Bacteroidetes ) is a major component of soil, marine and freshwater ecosystems. In this understudied family, Flavobacterium psychrophilum is a freshwater pathogen that infects salmonid fish worldwide, with critical environmental and economic impact. Here, we report an extensive transcriptome analysis that established the genome map of transcription start sites and transcribed regions, predicted alternative sigma factor regulons and regulatory RNAs, and documented gene expression profiles across 32 biological conditions mimicking the pathogen life cycle. The results link genes to environmental conditions and phenotypic traits and provide insights into gene regulation, highlighting similarities with better known bacteria and original characteristics linked to the phylogenetic position and the ecological niche of the bacterium. In particular, osmolarity appears as a signal for transition between free-living and within-host programs and expression patterns of secreted proteins shed light on probable virulence factors. Further investigations showed that a newly discovered sRNA widely conserved in the genus, Rfp18, is required for precise expression of proteases. By pointing proteins and regulatory elements probably involved in host–pathogen interactions, metabolic pathways, and molecular machineries, the results suggest many directions for future research; a website is made available to facilitate their use to fill knowledge gaps on flavobacteria

    Immune-focused multi-omics analysis of prostate cancer: leukocyte Ig-Like receptors are associated with disease progression

    No full text
    International audienceProstate cancer (PCa) immunotherapy has shown limited efficacy so far, even in advanced-stage cancers. The success rate of PCa immunotherapy might be improved by approaches more adapted to the immunobiology of the disease. The objective of this study was to perform a multi-omics analysis to identify immune genes associated with PCa progression to better characterize PCa immunobiology and propose new immunotherapeutic targets. mRNA, miRNA, methylation, copy number aberration, and single nucleotide variant datasets from The Cancer Genome Atlas PRAD cohort were analyzed after filtering for genes associated with immunity. Sparse partial least squares-discriminant analyses were performed to identify features associated with biochemical recurrence (BCR) in each type of omics data. Selected features predicted BCR with a balanced error rate (BER) of 0.20 to 0.51 in single-omics and of 0.05 in multi-omics analyses. Amongst features associated with BCR were genes from the Immunoglobulin Ig-like Receptor (LILR) family which are immune checkpoints with immunotherapeutic potential. Using Multivariate INTegrative (MINT) analysis, the association of five LILR genes with BCR was quantified in a combination of three RNA-seq datasets and confirmed with Kaplan-Meier analysis in both these and in an independent RNA-seq dataset. Finally, immunohistochemistry showed that a high number of LILRB1 positive cells within the tumors predicted long-term adverse outcomes. Thus, tumors characterized by abnormal expression of LILR genes have an elevated risk of recurring after definitive local therapy. The immunotherapeutic potential of these regulators to stimulate the immune response against PCa should be evaluated in pre-clinical models
    corecore