751 research outputs found

    Machine Learning and Integrative Analysis of Biomedical Big Data.

    Get PDF
    Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

    Biomarker lists stability in genomic studies: analysis and improvement by prior biological knowledge integration into the learning process

    Get PDF
    The analysis of high-throughput sequencing, microarray and mass spectrometry data has been demonstrated extremely helpful for the identification of those genes and proteins, called biomarkers, helpful for answering to both diagnostic/prognostic and functional questions. In this context, robustness of the results is critical both to understand the biological mechanisms underlying diseases and to gain sufficient reliability for clinical/pharmaceutical applications. Recently, different studies have proved that the lists of identified biomarkers are poorly reproducible, making the validation of biomarkers as robust predictors of a disease a still open issue. The reasons of these differences are referable to both data dimensions (few subjects with respect to the number of features) and heterogeneity of complex diseases, characterized by alterations of multiple regulatory pathways and of the interplay between different genes and the environment. Typically in an experimental design, data to analyze come from different subjects and different phenotypes (e.g. normal and pathological). The most widely used methodologies for the identification of significant genes related to a disease from microarray data are based on computing differential gene expression between different phenotypes by univariate statistical tests. Such approach provides information on the effect of specific genes as independent features, whereas it is now recognized that the interplay among weakly up/down regulated genes, although not significantly differentially expressed, might be extremely important to characterize a disease status. Machine learning algorithms are, in principle, able to identify multivariate nonlinear combinations of features and have thus the possibility to select a more complete set of experimentally relevant features. In this context, supervised classification methods are often used to select biomarkers, and different methods, like discriminant analysis, random forests and support vector machines among others, have been used, especially in cancer studies. Although high accuracy is often achieved in classification approaches, the reproducibility of biomarker lists still remains an open issue, since many possible sets of biological features (i.e. genes or proteins) can be considered equally relevant in terms of prediction, thus it is in principle possible to have a lack of stability even by achieving the best accuracy. This thesis represents a study of several computational aspects related to biomarker discovery in genomic studies: from the classification and feature selection strategies to the type and the reliability of the biological information used, proposing new approaches able to cope with the problem of the reproducibility of biomarker lists. The study has highlighted that, although reasonable and comparable classification accuracy can be achieved by different methods, further developments are necessary to achieve robust biomarker lists stability, because of the high number of features and the high correlation among them. In particular, this thesis proposes two different approaches to improve biomarker lists stability by using prior information related to biological interplay and functional correlation among the analyzed features. Both approaches were able to improve biomarker selection. The first approach, using prior information to divide the application of the method into different subproblems, improves results interpretability and offers an alternative way to assess lists reproducibility. The second, integrating prior information in the kernel function of the learning algorithm, improves lists stability. Finally, the interpretability of results is strongly affected by the quality of the biological information available and the analysis of the heterogeneities performed in the Gene Ontology database has revealed the importance of providing new methods able to verify the reliability of the biological properties which are assigned to a specific feature, discriminating missing or less specific information from possible inconsistencies among the annotations. These aspects will be more and more deepened in the future, as the new sequencing technologies will monitor an increasing number of features and the number of functional annotations from genomic databases will considerably grow in the next years.L’analisi di dati high-throughput basata sull’utilizzo di tecnologie di sequencing, microarray e spettrometria di massa si è dimostrata estremamente utile per l’identificazione di quei geni e proteine, chiamati biomarcatori, utili per rispondere a quesiti sia di tipo diagnostico/prognostico che funzionale. In tale contesto, la stabilità dei risultati è cruciale sia per capire i meccanismi biologici che caratterizzano le malattie sia per ottenere una sufficiente affidabilità per applicazioni in campo clinico/farmaceutico. Recentemente, diversi studi hanno dimostrato che le liste di biomarcatori identificati sono scarsamente riproducibili, rendendo la validazione di tali biomarcatori come indicatori stabili di una malattia un problema ancora aperto. Le ragioni di queste differenze sono imputabili sia alla dimensione dei dataset (pochi soggetti rispetto al numero di variabili) sia all’eterogeneità di malattie complesse, caratterizzate da alterazioni di più pathway di regolazione e delle interazioni tra diversi geni e l’ambiente. Tipicamente in un disegno sperimentale, i dati da analizzare provengono da diversi soggetti e diversi fenotipi (e.g. normali e patologici). Le metodologie maggiormente utilizzate per l’identificazione di geni legati ad una malattia si basano sull’analisi differenziale dell’espressione genica tra i diversi fenotipi usando test statistici univariati. Tale approccio fornisce le informazioni sull’effetto di specifici geni considerati come variabili indipendenti tra loro, mentre è ormai noto che l’interazione tra geni debolmente up/down regolati, sebbene non differenzialmente espressi, potrebbe rivelarsi estremamente importante per caratterizzare lo stato di una malattia. Gli algoritmi di machine learning sono, in linea di principio, capaci di identificare combinazioni non lineari delle variabili e hanno quindi la possibilità di selezionare un insieme più dettagliato di geni che sono sperimentalmente rilevanti. In tale contesto, i metodi di classificazione supervisionata vengono spesso utilizzati per selezionare i biomarcatori, e diversi approcci, quali discriminant analysis, random forests e support vector machines tra altri, sono stati utilizzati, soprattutto in studi oncologici. Sebbene con tali approcci di classificazione si ottenga un alto livello di accuratezza di predizione, la riproducibilità delle liste di biomarcatori rimane ancora una questione aperta, dato che esistono molteplici set di variabili biologiche (i.e. geni o proteine) che possono essere considerati ugualmente rilevanti in termini di predizione. Quindi in teoria è possibile avere un’insufficiente stabilità anche raggiungendo il massimo livello di accuratezza. Questa tesi rappresenta uno studio su diversi aspetti computazionali legati all’identificazione di biomarcatori in genomica: dalle strategie di classificazione e di feature selection adottate alla tipologia e affidabilità dell’informazione biologica utilizzata, proponendo nuovi approcci in grado di affrontare il problema della riproducibilità delle liste di biomarcatori. Tale studio ha evidenziato che sebbene un’accettabile e comparabile accuratezza nella predizione può essere ottenuta attraverso diversi metodi, ulteriori sviluppi sono necessari per raggiungere una robusta stabilità nelle liste di biomarcatori, a causa dell’alto numero di variabili e dell’alto livello di correlazione tra loro. In particolare, questa tesi propone due diversi approcci per migliorare la stabilità delle liste di biomarcatori usando l’informazione a priori legata alle interazioni biologiche e alla correlazione funzionale tra le features analizzate. Entrambi gli approcci sono stati in grado di migliorare la selezione di biomarcatori. Il primo approccio, usando l’informazione a priori per dividere l’applicazione del metodo in diversi sottoproblemi, migliora l’interpretabilità dei risultati e offre un modo alternativo per verificare la riproducibilità delle liste. Il secondo, integrando l’informazione a priori in una funzione kernel dell’algoritmo di learning, migliora la stabilità delle liste. Infine, l’interpretabilità dei risultati è fortemente influenzata dalla qualità dell’informazione biologica disponibile e l’analisi delle eterogeneità delle annotazioni effettuata sul database Gene Ontology rivela l’importanza di fornire nuovi metodi in grado di verificare l’attendibilità delle proprietà biologiche che vengono assegnate ad una specifica variabile, distinguendo la mancanza o la minore specificità di informazione da possibili inconsistenze tra le annotazioni. Questi aspetti verranno sempre più approfonditi in futuro, dato che le nuove tecnologie di sequencing monitoreranno un maggior numero di variabili e il numero di annotazioni funzionali derivanti dai database genomici crescer`a considerevolmente nei prossimi anni

    Exploring the Concept of the Digital Educator During COVID-19

    Get PDF
    T In many machine learning classification problems, datasets are usually of high dimensionality and therefore require efficient and effective methods for identifying the relative importance of their attributes, eliminating the redundant and irrelevant ones. Due to the huge size of the search space of the possible solutions, the attribute subset evaluation feature selection methods are not very suitable, so in these scenarios feature ranking methods are used. Most of the feature ranking methods described in the literature are univariate methods, which do not detect interactions between factors. In this paper, we propose two new multivariate feature ranking methods based on pairwise correlation and pairwise consistency, which have been applied for cancer gene expression and genotype-tissue expression classification tasks using public datasets. We statistically proved that the proposed methods outperform the state-of-the-art feature ranking methods Clustering Variation, Chi Squared, Correlation, Information Gain, ReliefF and Significance, as well as other feature selection methods for attribute subset evaluation based on correlation and consistency with the multi-objective evolutionary search strategy, and with the embedded feature selection methods C4.5 and LASSO. The proposed methods have been implemented on the WEKA platform for public use, making all the results reported in this paper repeatable and replicabl

    A framework for feature selection in high-dimensional domains

    Get PDF
    The introduction of DNA microarray technology has lead to enormous impact in cancer research, allowing researchers to analyze expression of thousands of genes in concert and relate gene expression patterns to clinical phenotypes. At the same time, machine learning methods have become one of the dominant approaches in an effort to identify cancer gene signatures, which could increase the accuracy of cancer diagnosis and prognosis. The central challenges is to identify the group of features (i.e. the biomarker) which take part in the same biological process or are regulated by the same mechanism, while minimizing the biomarker size, as it is known that few gene expression signatures are most accurate for phenotype discrimination. To account for these competing concerns, previous studies have proposed different methods for selecting a single subset of features that can be used as an accurate biomarker, capable of differentiating cancer from normal tissues, predicting outcome, detecting recurrence, and monitoring response to cancer treatment. The aim of this thesis is to propose a novel approach that pursues the concept of finding many potential predictive biomarkers. It is motivated from the biological assumption that, given the large numbers of different relationships which are possible between genes, it is highly possible to combine genes in many ways to produce signatures with similar predictive power. An intriguing advantage of our approach is that it increases the statistical power to capture more reliable and consistent biomarkers while a single predictor may not necessarily provide important clues as to biological differences of interest. Specifically, this thesis presents a framework for feature selection that is based upon a genetic algorithm, a well known approach recently proposed for feature selection. To mitigate the high computationally cost usually required by this algorithm, the framework structures the feature selection process into a multi-step approach which combines different categories of data mining methods. Starting from a ranking process performed at the first step, the following steps detail a wrapper approach where a genetic algorithm is coupled with a classifier to explore different feature subspaces looking for optimal biomarkers. The thesis presents in detail the framework and its validation on popular datasets which are usually considered as benchmark by the research community. The competitive classification power of the framework has been carefully evaluated and empirically confirms the benefits of its adoption. As well, experimental results obtained by the proposed framework are comparable to those obtained by analogous literature proposals. Finally, the thesis contributes with additional experiments which confirm the framework applicability to the categorization of the subject matter of documents

    Towards Prediction of Radiation Pneumonitis Arising from Lung Cancer Patients Using Machine Learning Approaches

    Get PDF
    Radiation pneumonitis (RP) is a potentially fatal side effect arising in lung cancer patients who receive radiotherapy as part of their treatment. For the modeling of RP outcomes data, several predictive models based on traditional statistical methods and machine learning techniques have been reported. However, no guidance to variation in performance has been provided to date. In this study, we explore several machine learning algorithms for classification of RP data. The performance of these classification algorithms is investigated in conjunction with several feature selection strategies and the impact of the feature selection strategy on performance is further evaluated. The extracted features include patients demographic, clinical and pathological variables, treatment techniques, and dose-volume metrics. In conjunction, we have been developing an in-house Matlab-based open source software tool, called DREES, customized for modeling and exploring dose response in radiation oncology. This software has been upgraded with a popular classification algorithm called support vector machine (SVM), which seems to provide improved performance in our exploration analysis and has strong potential to strengthen the ability of radiotherapy modelers in analyzing radiotherapy outcomes data

    Knowledge management overview of feature selection problem in high-dimensional financial data: Cooperative co-evolution and Map Reduce perspectives

    Get PDF
    The term big data characterizes the massive amounts of data generation by the advanced technologies in different domains using 4Vs volume, velocity, variety, and veracity-to indicate the amount of data that can only be processed via computationally intensive analysis, the speed of their creation, the different types of data, and their accuracy. High-dimensional financial data, such as time-series and space-Time data, contain a large number of features (variables) while having a small number of samples, which are used to measure various real-Time business situations for financial organizations. Such datasets are normally noisy, and complex correlations may exist between their features, and many domains, including financial, lack the al analytic tools to mine the data for knowledge discovery because of the high-dimensionality. Feature selection is an optimization problem to find a minimal subset of relevant features that maximizes the classification accuracy and reduces the computations. Traditional statistical-based feature selection approaches are not adequate to deal with the curse of dimensionality associated with big data. Cooperative co-evolution, a meta-heuristic algorithm and a divide-And-conquer approach, decomposes high-dimensional problems into smaller sub-problems. Further, MapReduce, a programming model, offers a ready-To-use distributed, scalable, and fault-Tolerant infrastructure for parallelizing the developed algorithm. This article presents a knowledge management overview of evolutionary feature selection approaches, state-of-The-Art cooperative co-evolution and MapReduce-based feature selection techniques, and future research directions

    Multi-class gene expression biomarker panel identification for the diagnosis of paediatric febrile illness

    Get PDF
    Febrile illness in children can result from infections by diverse viral or bacterial pathogens as well as inflammatory conditions or cancer. The limitations of the existing diagnostic pipeline, which relies on clinical symptoms and signs, pathogen detection, empirical treatment and diagnoses of exclusion, contribute to missed or de- layed diagnosis and unnecessary antibiotic use. The potential of host gene expression biomarkers measured in blood has been demonstrated for simplified binary diagnostic questions however, the clinical reality is that multiple potential aetiologies must be considered and prioritised on the basis of likelihood and risks of severe disease. In order to identify a biomarker panel which better reflects this clinical reality, we applied a multi-class supervised learning approach to whole blood transcriptomic datasets from children with infectious and inflammatory disease. Three datasets were used for the analyses presented here, a single microarray dataset, a meta-analysis of 12 publicly available microarray datasets and a newly generated RNA-sequencing dataset. These were used for preliminary investigations of the approach, discovery of a multi-class biomarker panel of febrile illness and valida- tion of the biomarker panel respectively. In the merged microarray discovery dataset a two-stage approach to feature selection and classification, based on LASSO and Ridge penalised regression was applied to distinguish 18 disease classes. Cost-sensitivity was incorporated in the approach as aetiologies of febrile illness vary considerably in the risk of severe disease. The resulting 161 transcript biomarker panel could reliably distinguish bacterial, viral, inflammatory, tuberculosis and malarial disease as well as pathogen specific aetiologies. The panel was then validated in a newly generated RNA-Seq dataset and compared to previously published binary biomarker panels. The analyses presented here demonstrate that a single test for the diagnosis of acute febrile illness in children is possible using host RNA biomarkers. A test which could distinguish multiple aetiologies soon after presentation could be used to reduce unnecessary antibiotic use, improve targetting of antibiotics to bacterial species and reduce delays in the diagnosis of inflammatory diseases.Open Acces

    Decoding Clinical Biomarker Space of COVID-19: Exploring Matrix Factorization-based Feature Selection Methods

    Full text link
    One of the most critical challenges in managing complex diseases like COVID-19 is to establish an intelligent triage system that can optimize the clinical decision-making at the time of a global pandemic. The clinical presentation and patients’ characteristics are usually utilized to identify those patients who need more critical care. However, the clinical evidence shows an unmet need to determine more accurate and optimal clinical biomarkers to triage patients under a condition like the COVID-19 crisis. Here we have presented a machine learning approach to find a group of clinical indicators from the blood tests of a set of COVID-19 patients that are predictive of poor prognosis and morbidity. Our approach consists of two interconnected schemes: Feature Selection and Prognosis Classification. The former is based on different Matrix Factorization (MF)-based methods, and the latter is performed using Random Forest algorithm. Our model reveals that Arterial Blood Gas (ABG) O2 Saturation and C-Reactive Protein (CRP) are the most important clinical biomarkers determining the poor prognosis in these patients. Our approach paves the path of building quantitative and optimized clinical management systems for COVID-19 and similar diseases

    OmicsON : integration of omics data with molecular networks and statistical procedures

    Get PDF
    A huge amount of atomized biological data collected in various databases and the need for a description of their relation by theoretical methods causes the development of data integration methods. The omics data analysis by integration of biological knowledge with mathematical procedures implemented in the OmicsON R library is presented in the paper. OmicsON is a tool for the integration of two sets of data: transcriptomics and metabolomics. In the workflow of the library, the functional grouping and statistical analysis are applied. Subgroups among the transcriptomic and metabolomics sets are created based on the biological knowledge stored in Reactome and String databases. It gives the possibility to analyze such sets of data by multivariate statistical procedures like Canonical Correlation Analysis (CCA) or Partial Least Squares (PLS). The integration of metabolomic and transcriptomic data based on the methodology contained in OmicsON helps to easily obtain information on the connection of data from two different sets. This information can significantly help in assessing the relationship between gene expression and metabolite concentrations, which in turn facilitates the biological interpretation of the analyzed process
    corecore