940 research outputs found

    Machine and deep learning meet genome-scale metabolic modeling

    Get PDF
    Omic data analysis is steadily growing as a driver of basic and applied molecular biology research. Core to the interpretation of complex and heterogeneous biological phenotypes are computational approaches in the fields of statistics and machine learning. In parallel, constraint-based metabolic modeling has established itself as the main tool to investigate large-scale relationships between genotype, phenotype, and environment. The development and application of these methodological frameworks have occurred independently for the most part, whereas the potential of their integration for biological, biomedical, and biotechnological research is less known. Here, we describe how machine learning and constraint-based modeling can be combined, reviewing recent works at the intersection of both domains and discussing the mathematical and practical aspects involved. We overlap systematic classifications from both frameworks, making them accessible to nonexperts. Finally, we delineate potential future scenarios, propose new joint theoretical frameworks, and suggest concrete points of investigation for this joint subfield. A multiview approach merging experimental and knowledge-driven omic data through machine learning methods can incorporate key mechanistic information in an otherwise biologically-agnostic learning process

    Developing a workflow for the multi-omics analysis of Daphnia

    Get PDF
    In the era of multi-omics, making reasonable statistical inferences through data integration is challenged by data heterogeneity, dimensionality constraints, and data harmonization. The biological system is presumed to function as a network where the physical relationships between genes (nodes) are represented by links (edges) connecting genes that interact. This thesis aims to develop a new and efficient workflow to analyse non-model organism multi-omics data for researchers who are entangled in the biology questions by using readily available software tools. The proposed approach was applied to the transcriptome and metabolome data of Daphnia magna under various dose rates of gamma radiation. The first part of this workflow compares and contrasts the transcriptional regulation of short-and long-term gamma radiation exposure. A group of genes which share a similar expression across different samples under the same conditions are known as modules, because they are likely to be functionally relevant. Modules were identified using WGCNA but biologically meaningful modules (significant modules) were selected through a novel approach that associates genes with significantly altered expression levels as a result of radiation (i.e. differentially expressed genes) with these candidate modules. Dynamic transcriptional regulation was modelled using transcription factor (TF) DNA binding patterns to associate TFs with expression responses captured by the modules. The biological functions of significant modules and their TF regulators were verified with functional annotations and mapped into the proposed Adverse Outcome Pathways (AOP) of D. magna, which describes the key events which contribute to fecundity reduction. The findings demonstrate that short term radiation impacts are entirely different from long term and cannot be used for long term prediction. The second part investigates the coordination of gene expression and metabolites with differential abundances induced by different gamma dose rates and the underlying mechanisms contributing to the varying extent of the reduction in fecundity. Significant modules which belong to the same design model of dose rates were combined and annotated with new functionality. The abundance of metabolites was also modelled with the same design model. Integrated pathway enrichment analysis was performed to discover and create pathway diagrams for visualising the multi-omics output. Finally, the performance of this workflow on explaining the reduction of fecundity of D. magna, which has not been described in previous studies, has been evaluated. Combining the information from the metabolome and transcriptome data, new insights suggest that the alteration to the cell cycle is the underlying mechanism contributing to the varying reduction of fecundity under the effect of different dose rates of radiation.M-G

    Multi-view learning and data integration for omics data

    Get PDF
    2015 - 2016In recent years, the advancement of high-throughput technologies, combined with the constant decrease of the data-storage costs, has led to the production of large amounts of data from diļ¬€erent experiments that characterise the same entities of interest. This information may relate to speciļ¬c aspects of a phenotypic entity (e.g. Gene expression), or can include the comprehensive and parallel measurement of multiple molecular events (e.g., DNA modiļ¬cations, RNA transcription and protein translation) in the same samples. Exploiting such complex and rich data is needed in the frame of systems biology for building global models able to explain complex phenotypes. For example, theuseofgenome-widedataincancerresearch, fortheidentiļ¬cationof groups of patients with similar molecular characteristics, has become a standard approach for applications in therapy-response, prognosis-prediction, and drugdevelopment.ƂăMoreover, the integration of gene expression data regarding cell treatment by drugs, and information regarding chemical structure of the drugs allowed scientist to perform more accurate drug repositioning tasks. Unfortunately, there is a big gap between the amount of information and the knowledge in which it is translated. Moreover, there is a huge need of computational methods able to integrate and analyse data to ļ¬ll this gap. Current researches in this area are following two diļ¬€erent integrative methods: one uses the complementary information of diļ¬€erent measurements for the 7 i i ā€œTemplateā€ ā€” 2017/6/9 ā€” 16:42 ā€” page 8 ā€” #8 i i i i i i study of complex phenotypes on the same samples (multi-view learning); the other tends to infer knowledge about the phenotype of interest by integrating and comparing the experiments relating to it with respect to those of diļ¬€erent phenotypes already known through comparative methods (meta-analysis). Meta-analysis can be thought as an integrative study of previous results, usually performed aggregating the summary statistics from diļ¬€erent studies. Due to its nature, meta-analysis usually involves homogeneous data. On the other hand, multi-view learning is a more ļ¬‚exible approach that considers the fusion of different data sources to get more stable and reliable estimates. Based on the type of data and the stage of integration, new methodologies have been developed spanning a landscape of techniques comprising graph theory, machine learning and statistics. Depending on the nature of the data and on the statistical problem to address, the integration of heterogeneous data can be performed at diļ¬€erent levels: early, intermediate and late. Early integration consists in concatenating data from diļ¬€erent views in a single feature space. Intermediate integration consists in transforming all the data sources in a common feature space before combining them. In the late integration methodologies, each view is analysed separately and the results are then combined. The purpose of this thesis is twofold: the former objective is the deļ¬nition of a data integration methodology for patient sub-typing (MVDA) and the latter is the development of a tool for phenotypic characterisation of nanomaterials (INSIdEnano). In this PhD thesis, I present the methodologies and the results of my research. MVDA is a multi-view methodology that aims to discover new statistically relevant patient sub-classes. Identify patient subtypes of a speciļ¬c diseases is a challenging task especially in the early diagnosis. This is a crucial point for the treatment, because not allthe patients aļ¬€ected bythe same diseasewill have the same prognosis or need the same drug treatment. This problem is usually solved by using transcriptomic data to identify groups of patients that share the same gene patterns. The main idea underlying this research work is that to combine more omics data for the same patients to obtain a better characterisation of their disease proļ¬le. The proposed methodology is a late integration approach i i ā€œTemplateā€ ā€” 2017/6/9 ā€” 16:42 ā€” page 9 ā€” #9 i i i i i i based on clustering. It works by evaluating the patient clusters in each single view and then combining the clustering results of all the views by factorising the membership matrices in a late integration manner. The eļ¬€ectiveness and the performance of our method was evaluated on six multi-view cancer datasets related to breast cancer, glioblastoma, prostate and ovarian cancer. The omics data used for the experiment are gene and miRNA expression, RNASeq and miRNASeq, Protein Expression and Copy Number Variation. In all the cases, patient sub-classes with statistical signiļ¬cance were found, identifying novel sub-groups previously not emphasised in literature. The experiments were also conducted by using prior information, as a new view in the integration process, to obtain higher accuracy in patientsā€™ classiļ¬cation. The method outperformed the single view clustering on all the datasets; moreover, it performs better when compared with other multi-view clustering algorithms and, unlike other existing methods, it can quantify the contribution of single views in the results. The method has also shown to be stable when perturbation is applied to the datasets by removing one patient at a time and evaluating the normalized mutual information between all the resulting clusterings. These observations suggest that integration of prior information with genomic features in sub-typing analysis is an eļ¬€ective strategy in identifying disease subgroups. INSIdE nano (Integrated Network of Systems bIology Eļ¬€ects of nanomaterials) is a novel tool for the systematic contextualisation of the eļ¬€ects of engineered nanomaterials (ENMs) in the biomedical context. In the recent years, omics technologies have been increasingly used to thoroughly characterise the ENMs molecular mode of action. It is possible to contextualise the molecular eļ¬€ects of diļ¬€erent types of perturbations by comparing their patterns of alterations. While this approach has been successfully used for drug repositioning, it is still missing to date a comprehensive contextualisation of the ENM mode of action. The idea behind the tool is to use analytical strategies to contextualise or position the ENM with the respect to relevant phenotypes that have been studied in literature, (such as diseases, drug treatments, and other chemical exposures) by comparing their patterns of molecular alteration. This could greatly increase the knowledge on the ENM molecular eļ¬€ects and in turn i i ā€œTemplateā€ ā€” 2017/6/9 ā€” 16:42 ā€” page 10 ā€” #10 i i i i i i contribute to the deļ¬nition of relevant pathways of toxicity as well as help in predicting the potential involvement of ENM in pathogenetic events or in novel therapeutic strategies. The main hypothesis is that suggestive patterns of similarity between sets of phenotypes could be an indication of a biological association to be further tested in toxicological or therapeutic frames. Based on the expression signature, associated to each phenotype, the strength of similarity between each pair of perturbations has been evaluated and used to build a large network of phenotypes. To ensure the usability of INSIdE nano, a robust and scalable computational infrastructure has been developed, to scan this large phenotypic network and a web-based eļ¬€ective graphic user interface has been built. Particularly, INSIdE nano was scanned to search for clique sub-networks, quadruplet structures of heterogeneous nodes (a disease, a drug, a chemical and a nanomaterial) completely interconnected by strong patterns of similarity (or anti-similarity). The predictions have been evaluated for a set of known associations between diseases and drugs, based on drug indications in clinical practice, and between diseases and chemical, based on literature-based causal exposure evidence, and focused on the possible involvement of nanomaterials in the most robust cliques. The evaluation of INSIdE nano conļ¬rmed that it highlights known disease-drug and disease-chemical connections. Moreover, disease similarities agree with the information based on their clinical features, as well as drugs and chemicals, mirroring their resemblance based on the chemical structure. Altogether, the results suggest that INSIdE nano can also be successfully used to contextualise the molecular eļ¬€ects of ENMs and infer their connections to other better studied phenotypes, speeding up their safety assessment as well as opening new perspectives concerning their usefulness in biomedicine. [edited by author]Lā€™avanzamento tecnologico delle tecnologie high-throughput, combinato con il costante decremento dei costi di memorizzazione, ha portato alla produzione di grandi quantit`a di dati provenienti da diversi esperimenti che caratterizzano le stesse entit`a di interesse. Queste informazioni possono essere relative a speciļ¬ci aspetti fenotipici (per esempio lā€™espressione genica), o possono includere misure globali e parallele di diversi aspetti molecolari (per esempio modiļ¬che del DNA, trascrizione dellā€™RNA e traduzione delle proteine) negli stessi campioni. Analizzare tali dati complessi `e utile nel campo della systems biology per costruire modelli capaci di spiegare fenotipi complessi. Ad esempio, lā€™uso di dati genome-wide nella ricerca legata al cancro, per lā€™identiļ¬cazione di gruppi di pazienti con caratteristiche molecolari simili, `e diventato un approccio standard per una prognosi precoce piu` accurata e per lā€™identiļ¬cazione di terapie speciļ¬che. Inoltre, lā€™integrazione di dati di espressione genica riguardanti il trattamento di cellule tramite farmaci ha permesso agli scienziati di ottenere accuratezze elevate per il drug repositioning. Purtroppo, esiste un grosso divario tra i dati prodotti, in seguito ai numerosi esperimenti, e lā€™informazione in cui essi sono tradotti. Quindi la comunit`a scientiļ¬ca ha una forte necessit`a di metodi computazionali per poter integrare e analizzate tali dati per riempire questo divario. La ricerca nel campo delle analisi multi-view, segue due diversi metodi di analisi integrative: uno usa le informazioni complementari di diverse misure per studiare fenotipi complessi su diversi campioni (multi-view learning); lā€™altro tende ad inferire conoscenza sul fenotipo di interesse di una entit`a confrontando gli esperimenti ad essi relativi con quelli di altre entit`a fenotipiche gi`a note in letteratura (meta-analisi). La meta-analisi pu`o essere pensata come uno studio comparativo dei risultati identiļ¬cati in un particolare esperimento, rispetto a quelli di studi precedenti. A causa della sua natura, la meta-analisi solitamente coinvolge dati omogenei. Dā€™altra parte, il multi-view learning `e un approccio piu` ļ¬‚essibile che considera la fusione di diverse sorgenti di dati per ottenere stime piu` stabili e aļ¬ƒdabili. In base al tipo di dati e al livello di integrazione, nuove metodologie sono state sviluppate a partire da tecniche basate sulla teoria dei graļ¬, machine learning e statistica. In base alla natura dei dati e al problema statistico da risolvere, lā€™integrazione di dati eterogenei pu`o essere eļ¬€ettuata a diversi livelli: early, intermediate e late integration. Le tecniche di early integration consistono nella concatenazione dei dati delle diverse viste in un unico spazio delle feature. Le tecniche di intermediate integration consistono nella trasformazione di tutte le sorgenti dati in un unico spazio comune prima di combinarle. Nelle tecniche di late integration, ogni vista `e analizzata separatamente e i risultati sono poi combinati. Lo scopo di questa tesi `e duplice: il primo obbiettivo `e la deļ¬nizione di una metodologia di integrazione dati per la sotto-tipizzazione dei pazienti (MVDA) e il secondo `e lo sviluppo di un tool per la caratterizzazione fenotipica dei nanomateriali (INSIdEnano). In questa tesi di dottorato presento le metodologie e i risultati della mia ricerca. MVDA `e una tecnica multi-view con lo scopo di scoprire nuove sotto tipologie di pazienti statisticamente rilevanti. Identiļ¬care sottotipi di pazienti per una malattia speciļ¬ca `e un obbiettivo con alto rilievo nella pratica clinica, soprattutto per la diagnosi precoce delle malattie. Questo problema `e generalmente risolto usando dati di trascrittomica per identiļ¬care i gruppi di pazienti che condividono gli stessi pattern di alterazione genica. Lā€™idea principale alla base di questo lavoro di ricerca `e quello di combinare piu` tipologie di dati omici per gli stessi pazienti per ottenere una migliore caratterizzazione del loro proļ¬lo. La metodologia proposta `e un approccio di tipo late integration basato sul clustering. Per ogni vista viene eļ¬€ettuato il clustering dei pazienti rappresentato sotto forma di matrici di membership. I risultati di tutte le viste vengono poi combinati tramite una tecnica di fattorizzazione di matrici per ottenere i metacluster ļ¬nali multi-view. La fattibilit`a e le performance del nostro metodo sono stati valutati su sei dataset multi-view relativi al tumore al seno, glioblastoma, cancro alla prostata e alle ovarie. I dati omici usati per gli esperimenti sono relativi alla espressione dei geni, espressione dei mirna, RNASeq, miRNASeq, espressione delle proteine e della Copy Number Variation. In tutti i dataset sono state identiļ¬cate sotto-tipologie di pazienti con rilevanza statistica, identiļ¬cando nuovi sottogruppi precedentemente non noti in letteratura. Ulteriori esperimenti sono stati condotti utilizzando la conoscenza a priori relativa alle macro classi dei pazienti. Tale informazione `e stata considerata come una ulteriore vista nel processo di integrazione per ottenere una accuratezza piu` elevata nella classiļ¬cazione dei pazienti. Il metodo proposto ha performance migliori degli algoritmi di clustering clussici su tutti i dataset. MVDA ha ottenuto risultati migliori in confronto a altri algoritmi di integrazione di tipo ealry e intermediate integration. Inoltre il metodo `e in grado di calcolare il contributo di ogni singola vista al risultato ļ¬nale. I risultati mostrano, anche, che il metodo `e stabile in caso di perturbazioni del dataset eļ¬€ettuate rimuovendo un paziente alla volta (leave-one-out). Queste osservazioni suggeriscono che lā€™integrazione di informazioni a priori e feature genomiche, da utilizzare congiuntamente durante lā€™analisi, `e una strategia vincente nellā€™identiļ¬cazione di sotto-tipologie di malattie. INSIdE nano (Integrated Network of Systems bIology Eļ¬€ects of nanomaterials) `e un tool innovativo per la contestualizzazione sistematica degli eļ¬€etti delle nanoparticelle (ENMs) in contesti biomedici. Negli ultimi anni, le tecnologie omiche sono state ampiamente applicate per caratterizzare i nanomateriali a livello molecolare. Eā€™ possibile contestualizzare lā€™eļ¬€etto a livello molecolare di diversi tipi di perturbazioni confrontando i loro pattern di alterazione genica. Mentre tale approccio `e stato applicato con successo nel campo del drug repositioning, una contestualizzazione estensiva dellā€™eļ¬€etto dei nanomateriali sulle cellule `e attualmente mancante. Lā€™idea alla base del tool `e quello di usare strategie comparative di analisi per contestualizzare o posizionare i nanomateriali in confronto a fenotipi rilevanti che sono stati studiati in letteratura (come ad esempio malattie dellā€™uomo, trattamenti farmacologici o esposizioni a sostanze chimiche) confrontando i loro pattern di alterazione molecolare. Questo potrebbe incrementare la conoscenza dellā€™eļ¬€etto molecolare dei nanomateriali e contribuire alla deļ¬nizione di nuovi pathway tossicologici oppure identiļ¬care eventuali coinvolgimenti dei nanomateriali in eventi patologici o in nuove strategie terapeutiche. Lā€™ipotesi alla base `e che lā€™identiļ¬cazione di pattern di similarit`a tra insiemi di fenotipi potrebbe essere una indicazione di una associazione biologica che deve essere successivamente testata in ambito tossicologico o terapeutico. Basandosi sulla ļ¬rma di espressione genica, associata ad ogni fenotipo, la similarit`a tra ogni coppia di perturbazioni `e stata valuta e usata per costruire una grande network di interazione tra fenotipi. Per assicurare lā€™utilizzo di INSIdE nano, `e stata sviluppata una infrastruttura computazionale robusta e scalabile, allo scopo di analizzare tale network. Inoltre `e stato realizzato un sito web che permettesse agli utenti di interrogare e visualizzare la network in modo semplice ed eļ¬ƒciente. In particolare, INSIdE nano `e stato analizzato cercando tutte le possibili clique di quattro elementi eterogenei (un nanomateriale, un farmaco, una malattia e una sostanza chimica). Una clique `e una sotto network completamente connessa, dove ogni elemento `e collegato con tutti gli altri. Di tutte le clique, sono state considerate come signiļ¬cative solo quelle per le quali le associazioni tra farmaco e malattia e farmaco e sostanze chimiche sono note. Le connessioni note tra farmaci e malattie si basano sul fatto che il farmaco `e prescritto per curare tale malattia. Le connessioni note tra malattia e sostanze chimiche si basano su evidenze presenti in letteratura del fatto che tali sostanze causano la malattia. Il focus `e stato posto sul possibile coinvolgimento dei nanomateriali con le malattie presenti in tali clique. La valutazione di INSIdE nano ha confermato che esso mette in evidenza connessioni note tra malattie e farmaci e tra malattie e sostanze chimiche. Inoltre la similarit`a tra le malattie calcolata in base ai geni `e conforme alle informazioni basate sulle loro informazioni cliniche. Allo stesso modo le similarit`a tra farmaci e sostanze chimiche rispecchiano le loro similarit`a basate sulla struttura chimica. Nellā€™insieme, i risultati suggeriscono che INSIdE nano pu`o essere usato per contestualizzare lā€™eļ¬€etto molecolare dei nanomateriali e inferirne le connessioni rispetto a fenotipi precedentemente studiati in letteratura. Questo metodo permette di velocizzare il processo di valutazione della loro tossicit`a e apre nuove prospettive per il loro utilizzo nella biomedicina. [a cura dell'autore]XV n.s

    Modern Views of Machine Learning for Precision Psychiatry

    Full text link
    In light of the NIMH's Research Domain Criteria (RDoC), the advent of functional neuroimaging, novel technologies and methods provide new opportunities to develop precise and personalized prognosis and diagnosis of mental disorders. Machine learning (ML) and artificial intelligence (AI) technologies are playing an increasingly critical role in the new era of precision psychiatry. Combining ML/AI with neuromodulation technologies can potentially provide explainable solutions in clinical practice and effective therapeutic treatment. Advanced wearable and mobile technologies also call for the new role of ML/AI for digital phenotyping in mobile mental health. In this review, we provide a comprehensive review of the ML methodologies and applications by combining neuroimaging, neuromodulation, and advanced mobile technologies in psychiatry practice. Additionally, we review the role of ML in molecular phenotyping and cross-species biomarker identification in precision psychiatry. We further discuss explainable AI (XAI) and causality testing in a closed-human-in-the-loop manner, and highlight the ML potential in multimedia information extraction and multimodal data fusion. Finally, we discuss conceptual and practical challenges in precision psychiatry and highlight ML opportunities in future research

    Biomarker lists stability in genomic studies: analysis and improvement by prior biological knowledge integration into the learning process

    Get PDF
    The analysis of high-throughput sequencing, microarray and mass spectrometry data has been demonstrated extremely helpful for the identification of those genes and proteins, called biomarkers, helpful for answering to both diagnostic/prognostic and functional questions. In this context, robustness of the results is critical both to understand the biological mechanisms underlying diseases and to gain sufficient reliability for clinical/pharmaceutical applications. Recently, different studies have proved that the lists of identified biomarkers are poorly reproducible, making the validation of biomarkers as robust predictors of a disease a still open issue. The reasons of these differences are referable to both data dimensions (few subjects with respect to the number of features) and heterogeneity of complex diseases, characterized by alterations of multiple regulatory pathways and of the interplay between different genes and the environment. Typically in an experimental design, data to analyze come from different subjects and different phenotypes (e.g. normal and pathological). The most widely used methodologies for the identification of significant genes related to a disease from microarray data are based on computing differential gene expression between different phenotypes by univariate statistical tests. Such approach provides information on the effect of specific genes as independent features, whereas it is now recognized that the interplay among weakly up/down regulated genes, although not significantly differentially expressed, might be extremely important to characterize a disease status. Machine learning algorithms are, in principle, able to identify multivariate nonlinear combinations of features and have thus the possibility to select a more complete set of experimentally relevant features. In this context, supervised classification methods are often used to select biomarkers, and different methods, like discriminant analysis, random forests and support vector machines among others, have been used, especially in cancer studies. Although high accuracy is often achieved in classification approaches, the reproducibility of biomarker lists still remains an open issue, since many possible sets of biological features (i.e. genes or proteins) can be considered equally relevant in terms of prediction, thus it is in principle possible to have a lack of stability even by achieving the best accuracy. This thesis represents a study of several computational aspects related to biomarker discovery in genomic studies: from the classification and feature selection strategies to the type and the reliability of the biological information used, proposing new approaches able to cope with the problem of the reproducibility of biomarker lists. The study has highlighted that, although reasonable and comparable classification accuracy can be achieved by different methods, further developments are necessary to achieve robust biomarker lists stability, because of the high number of features and the high correlation among them. In particular, this thesis proposes two different approaches to improve biomarker lists stability by using prior information related to biological interplay and functional correlation among the analyzed features. Both approaches were able to improve biomarker selection. The first approach, using prior information to divide the application of the method into different subproblems, improves results interpretability and offers an alternative way to assess lists reproducibility. The second, integrating prior information in the kernel function of the learning algorithm, improves lists stability. Finally, the interpretability of results is strongly affected by the quality of the biological information available and the analysis of the heterogeneities performed in the Gene Ontology database has revealed the importance of providing new methods able to verify the reliability of the biological properties which are assigned to a specific feature, discriminating missing or less specific information from possible inconsistencies among the annotations. These aspects will be more and more deepened in the future, as the new sequencing technologies will monitor an increasing number of features and the number of functional annotations from genomic databases will considerably grow in the next years.Lā€™analisi di dati high-throughput basata sullā€™utilizzo di tecnologie di sequencing, microarray e spettrometria di massa si ĆØ dimostrata estremamente utile per lā€™identificazione di quei geni e proteine, chiamati biomarcatori, utili per rispondere a quesiti sia di tipo diagnostico/prognostico che funzionale. In tale contesto, la stabilitĆ  dei risultati ĆØ cruciale sia per capire i meccanismi biologici che caratterizzano le malattie sia per ottenere una sufficiente affidabilitĆ  per applicazioni in campo clinico/farmaceutico. Recentemente, diversi studi hanno dimostrato che le liste di biomarcatori identificati sono scarsamente riproducibili, rendendo la validazione di tali biomarcatori come indicatori stabili di una malattia un problema ancora aperto. Le ragioni di queste differenze sono imputabili sia alla dimensione dei dataset (pochi soggetti rispetto al numero di variabili) sia allā€™eterogeneitĆ  di malattie complesse, caratterizzate da alterazioni di piĆ¹ pathway di regolazione e delle interazioni tra diversi geni e lā€™ambiente. Tipicamente in un disegno sperimentale, i dati da analizzare provengono da diversi soggetti e diversi fenotipi (e.g. normali e patologici). Le metodologie maggiormente utilizzate per lā€™identificazione di geni legati ad una malattia si basano sullā€™analisi differenziale dellā€™espressione genica tra i diversi fenotipi usando test statistici univariati. Tale approccio fornisce le informazioni sullā€™effetto di specifici geni considerati come variabili indipendenti tra loro, mentre ĆØ ormai noto che lā€™interazione tra geni debolmente up/down regolati, sebbene non differenzialmente espressi, potrebbe rivelarsi estremamente importante per caratterizzare lo stato di una malattia. Gli algoritmi di machine learning sono, in linea di principio, capaci di identificare combinazioni non lineari delle variabili e hanno quindi la possibilitĆ  di selezionare un insieme piĆ¹ dettagliato di geni che sono sperimentalmente rilevanti. In tale contesto, i metodi di classificazione supervisionata vengono spesso utilizzati per selezionare i biomarcatori, e diversi approcci, quali discriminant analysis, random forests e support vector machines tra altri, sono stati utilizzati, soprattutto in studi oncologici. Sebbene con tali approcci di classificazione si ottenga un alto livello di accuratezza di predizione, la riproducibilitĆ  delle liste di biomarcatori rimane ancora una questione aperta, dato che esistono molteplici set di variabili biologiche (i.e. geni o proteine) che possono essere considerati ugualmente rilevanti in termini di predizione. Quindi in teoria ĆØ possibile avere unā€™insufficiente stabilitĆ  anche raggiungendo il massimo livello di accuratezza. Questa tesi rappresenta uno studio su diversi aspetti computazionali legati allā€™identificazione di biomarcatori in genomica: dalle strategie di classificazione e di feature selection adottate alla tipologia e affidabilitĆ  dellā€™informazione biologica utilizzata, proponendo nuovi approcci in grado di affrontare il problema della riproducibilitĆ  delle liste di biomarcatori. Tale studio ha evidenziato che sebbene unā€™accettabile e comparabile accuratezza nella predizione puĆ² essere ottenuta attraverso diversi metodi, ulteriori sviluppi sono necessari per raggiungere una robusta stabilitĆ  nelle liste di biomarcatori, a causa dellā€™alto numero di variabili e dellā€™alto livello di correlazione tra loro. In particolare, questa tesi propone due diversi approcci per migliorare la stabilitĆ  delle liste di biomarcatori usando lā€™informazione a priori legata alle interazioni biologiche e alla correlazione funzionale tra le features analizzate. Entrambi gli approcci sono stati in grado di migliorare la selezione di biomarcatori. Il primo approccio, usando lā€™informazione a priori per dividere lā€™applicazione del metodo in diversi sottoproblemi, migliora lā€™interpretabilitĆ  dei risultati e offre un modo alternativo per verificare la riproducibilitĆ  delle liste. Il secondo, integrando lā€™informazione a priori in una funzione kernel dellā€™algoritmo di learning, migliora la stabilitĆ  delle liste. Infine, lā€™interpretabilitĆ  dei risultati ĆØ fortemente influenzata dalla qualitĆ  dellā€™informazione biologica disponibile e lā€™analisi delle eterogeneitĆ  delle annotazioni effettuata sul database Gene Ontology rivela lā€™importanza di fornire nuovi metodi in grado di verificare lā€™attendibilitĆ  delle proprietĆ  biologiche che vengono assegnate ad una specifica variabile, distinguendo la mancanza o la minore specificitĆ  di informazione da possibili inconsistenze tra le annotazioni. Questi aspetti verranno sempre piĆ¹ approfonditi in futuro, dato che le nuove tecnologie di sequencing monitoreranno un maggior numero di variabili e il numero di annotazioni funzionali derivanti dai database genomici crescer`a considerevolmente nei prossimi anni

    Pacific Symposium on Biocomputing 2023

    Get PDF
    The Pacific Symposium on Biocomputing (PSB) 2023 is an international, multidisciplinary conference for the presentation and discussion of current research in the theory and application of computational methods in problems of biological significance. Presentations are rigorously peer reviewed and are published in an archival proceedings volume. PSB 2023 will be held on January 3-7, 2023 in Kohala Coast, Hawaii. Tutorials and workshops will be offered prior to the start of the conference.PSB 2023 will bring together top researchers from the US, the Asian Pacific nations, and around the world to exchange research results and address open issues in all aspects of computational biology. It is a forum for the presentation of work in databases, algorithms, interfaces, visualization, modeling, and other computational methods, as applied to biological problems, with emphasis on applications in data-rich areas of molecular biology.The PSB has been designed to be responsive to the need for critical mass in sub-disciplines within biocomputing. For that reason, it is the only meeting whose sessions are defined dynamically each year in response to specific proposals. PSB sessions are organized by leaders of research in biocomputing's 'hot topics.' In this way, the meeting provides an early forum for serious examination of emerging methods and approaches in this rapidly changing field

    Leveraging big data resources and data integration in biology: applying computational systems analyses and machine learning to gain insights into the biology of cancers

    Get PDF
    Recently, many "molecular profiling" projects have yielded vast amounts of genetic, epigenetic, transcription, protein expression, metabolic and drug response data for cancerous tumours, healthy tissues, and cell lines. We aim to facilitate a multi-scale understanding of these high-dimensional biological data and the complexity of the relationships between the different data types taken from human tumours. Further, we intend to identify molecular disease subtypes of various cancers, uncover the subtype-specific drug targets and identify sets of therapeutic molecules that could potentially be used to inhibit these targets. We collected data from over 20 publicly available resources. We then leverage integrative computational systems analyses, network analyses and machine learning, to gain insights into the pathophysiology of pancreatic cancer and 32 other human cancer types. Here, we uncover aberrations in multiple cell signalling and metabolic pathways that implicate regulatory kinases and the Warburg effect as the likely drivers of the distinct molecular signatures of three established pancreatic cancer subtypes. Then, we apply an integrative clustering method to four different types of molecular data to reveal that pancreatic tumours can be segregated into two distinct subtypes. We define sets of proteins, mRNAs, miRNAs and DNA methylation patterns that could serve as biomarkers to accurately differentiate between the two pancreatic cancer subtypes. Then we confirm the biological relevance of the identified biomarkers by showing that these can be used together with pattern-recognition algorithms to infer the drug sensitivity of pancreatic cancer cell lines accurately. Further, we evaluate the alterations of metabolic pathway genes across 32 human cancers. We find that while alterations of metabolic genes are pervasive across all human cancers, the extent of these gene alterations varies between them. Based on these gene alterations, we define two distinct cancer supertypes that tend to be associated with different clinical outcomes and show that these supertypes are likely to respond differently to anticancer drugs. Overall, we show that the time has already arrived where we can leverage available data resources to potentially elicit more precise and personalised cancer therapies that would yield better clinical outcomes at a much lower cost than is currently being achieved

    Algorithms for pre-microrna classification and a GPU program for whole genome comparison

    Get PDF
    MicroRNAs (miRNAs) are non-coding RNAs with approximately 22 nucleotides that are derived from precursor molecules. These precursor molecules or pre-miRNAs often fold into stem-loop hairpin structures. However, a large number of sequences with pre-miRNA-like hairpin can be found in genomes. It is a challenge to distinguish the real pre-miRNAs from other hairpin sequences with similar stem-loops (referred to as pseudo pre-miRNAs). The first part of this dissertation presents a new method, called MirID, for identifying and classifying microRNA precursors. MirID is comprised of three steps. Initially, a combinatorial feature mining algorithm is developed to identify suitable feature sets. Then, the feature sets are used to train support vector machines to obtain classification models, based on which classifier ensemble is constructed. Finally, an AdaBoost algorithm is adopted to further enhance the accuracy of the classifier ensemble. Experimental results on a variety of species demonstrate the good performance of the proposed approach, and its superiority over existing methods. In the second part of this dissertation, A GPU (Graphics Processing Unit) program is developed for whole genome comparison. The goal for the research is to identify the commonalities and differences of two genomes from closely related organisms, via multiple sequencing alignments by using a seed and extend technique to choose reliable subsets of exact or near exact matches, which are called anchors. A rigorous method named Smith-Waterman search is applied for the anchor seeking, but takes days and months to map millions of bases for mammalian genome sequences. With GPU programming, which is designed to run in parallel hundreds of short functions called threads, up to 100X speed up is achieved over similar CPU executions

    Bioinformatics Techniques for Studying Drug Resistance In HIV and Staphylococcus Aureus

    Get PDF
    The worldwide HIV/AIDS pandemic has been partly controlled and treated by antivirals targeting HIV protease, integrase and reverse transcriptase, however, drug resistance has become a serious problem. HIV-1 drug resistance to protease inhibitors evolves by mutations in the PR gene. The resistance mutations can alter protease catalytic activity, inhibitor binding, and stability. Different machine learning algorithms (restricted boltzmann machines, clustering, etc.) have been shown to be effective machine learning tools for classification of genomic and resistance data. Application of restricted boltzmann machine produced highly accurate and robust classification of HIV protease resistance. They can also be used to compare resistance profiles of different protease inhibitors. HIV drug resistance has also been studied by enzyme kinetics and X-ray crystallography. Triple mutant HIV-1 protease with resistance mutations V32I, I47V and V82I has been used as a model for the active site of HIV-2 protease. The effects of four investigational antiviral inhibitors was measured for Triple mutant. The tested compounds had significantly worse inhibition of triple mutant with Ki values of 17-40 nM compared to 2-10 pM for wild type protease. The crystal structure of triple mutant in complex with GRL01111 was solved and showed few changes in protease interactions with inhibitor. These new inhibitors are not expected to be effective for HIV-2 protease or HIV-1 protease with changes V32I, I47V and V82I. Methicillin-resistant Staphylococcus aureus (MRSA) is an opportunistic pathogen that causes hospital and community-acquired infections. Antibiotic resistance occurs because of newly acquired low-affinity penicillin-binding protein (PBP2a). Transcriptome analysis was performed to determine how MuM (mutated PBP2 gene) responds to spermine and how Mu50 (wild type) responds to spermine and spermineā€“Ī²-lactam synergy. Exogenous spermine and oxacillin were found to alter some significant gene expression patterns with major biochemical pathways (iron, sigB regulon) in MRSA with mutant PBP2 protein
    • ā€¦
    corecore