160 research outputs found

    Analysis methods for studying the 3D architecture of the genome

    Get PDF

    Learning Networks with Categorical Data using Distance Correlation, and A Novel Graph-Based Multivariate Test

    Get PDF
    We study the use of distance correlation for statistical inference on categorical data, especially the induction of probability networks. Szekely et al. first defined distance correlation for continuous variables in [42], and Zhang translated the concept into the categorical setting in [57] by defining dCor(X,Y) for categorical variables X = (x1,...,xI) and Y = (y1,...,yJ) where P(X=xi)=[pi]i and P(Y=yi)=[pi]j with the formula [Please open the document] Part I of the dissertation covers the background we need to understand this formula, and prepares us to analyze the properties and performance of its applications. Part II then presents the main results of the dissertation, applying distance correlation to learn the structure of probability networks with categorical nodes. We cover in detail how the distance correlation measure may be combined with search methods based on graphical models to induce network structure. This leads to our empirical results obtained by enhancing the INeS software library [6]. These results involve experiments using six data sets such as the Danish Jersey cattle blood type determination data and the ALARM network; in terms of accuracy metrics such as edges missed from the true network, induction with distance correlation achieves higher accuracy relative on average than does induction with existing measures such as mutual information and chi-squared. We conclude Part II by connecting to earlier joint work with Zhang in [58] on the use of conditional distance covariance for conditional independence and homogeneity tests in large sparse three-way tables. The simulation studies in this work offer another source of intuition for why distance correlation may be able to recover network structure more accurately than traditional measures. In Part III, we end the dissertation by discussing another application of graphical models, in this case to the derivation of a graph-based multivariate test. The test statistic is computationally cheap, and proven to converge to a chi-squared distribution with favorable asymptotics. We present empirical results in which we use the test to analyze the roles of various oncogenic and suppressor pathways in tumor progression

    Genome-scale MicroRNA target prediction through clustering with Dirichlet process mixture model

    Get PDF
    Background: MicroRNA regulation is fundamentally responsible for fine-tuning the whole gene network in human and has been implicated in most physiological and pathological conditions. Studying regulatory impact of microRNA on various cellular and disease processes has resulted in numerous computational tools that investigate microRNA-mRNA interactions through the prediction of static binding site highly dependent on sequence pairing. However, what hindered the practical use of such target prediction is the interplay between competing and cooperative microRNA binding that complicates the whole regulatory process exceptionally. Results: We developed a new method for improved microRNA target prediction based on Dirichlet Process Gaussian Mixture Model (DPGMM) using a large collection of molecular features associated with microRNA, mRNA, and the interaction sites. Multiple validations based on microRNA-mRNA interactions reported in recent large-scale sequencing analyses and a screening test on the entire human transcriptome show that our model outperformed several state-of-the-art tools in terms of promising predictive power on binding sites specific to transcript isoforms with reduced false positive prediction. Last, we illustrated the use of predicted targets in constructing conditional microRNA-mediated gene regulation networks in human cancer. Conclusion: The probability-based binding site prediction provides not only a useful tool for differentiating microRNA targets according to the estimated binding potential but also a capability highly important for exploring dynamic regulation where binding competition is involved

    Seventh Biennial Report : June 2003 - March 2005

    No full text

    Mitmekesiste bioloogiliste andmete ĂĽhendamine ja analĂĽĂĽs

    Get PDF
    Väitekirja elektrooniline versioon ei sisalda publikatsiooneTänu tehnoloogiate arengule on bioloogiliste andmete maht viimastel aastatel mitmekordistunud. Need andmed katavad erinevaid bioloogia valdkondi. Piirdudes vaid ühe andmestikuga saab bioloogilisi protsesse või haigusi uurida vaid ühest aspektist korraga. Seetõttu on tekkinud üha suurem vajadus masinõppe meetodite järele, mis aitavad kombineerida eri valdkondade andmeid, et uurida bioloogilisi protsesse tervikuna. Lisaks on nõudlus usaldusväärsete haigusspetsiifiliste andmestike kogude järele, mis võimaldaks vastavaid analüüse efektiivsemalt läbi viia. Käesolev väitekiri kirjeldab, kuidas rakendada masinõppel põhinevaid integratsiooni meetodeid erinevate bioloogiliste küsimuste uurimiseks. Me näitame kuidas integreeritud andmetel põhinev analüüs võimaldab paremini aru saada bioloogilistes protsessidest kolmes valdkonnas: Alzheimeri tõbi, toksikoloogia ja immunoloogia. Alzheimeri tõbi on vanusega seotud neurodegeneratiivne haigus millel puudub efektiivne ravi. Väitekirjas näitame, kuidas integreerida erinevaid Alzheimeri tõve spetsiifilisi andmestikke, et moodustada heterogeenne graafil põhinev Alzheimeri spetsiifiline andmestik HENA. Seejärel demonstreerime süvaõppe meetodi, graafi konvolutsioonilise tehisnärvivõrgu, rakendamist HENA-le, et leida potentsiaalseid haigusega seotuid geene. Teiseks uurisime kroonilist immuunpõletikulist haigust psoriaasi. Selleks kombineerisime patsientide verest ja nahast pärinevad laboratoorsed mõõtmised kliinilise infoga ning integreerisime vastavad analüüside tulemused tuginedes valdkonnaspetsiifilistel teadmistel. Töö viimane osa keskendub toksilisuse testimise strateegiate edasiarendusele. Toksilisuse testimine on protsess, mille käigus hinnatakse, kas uuritavatel kemikaalidel esineb organismile kahjulikke toimeid. See on vajalik näiteks ravimite ohutuse hindamisel. Töös me tuvastasime sarnase toimemehhanismiga toksiliste ühendite rühmad. Lisaks arendasime klassifikatsiooni mudeli, mis võimaldab hinnata uute ühendite toksilisust.A fast advance in biotechnological innovation and decreasing production costs led to explosion of experimental data being produced in laboratories around the world. Individual experiments allow to understand biological processes, e.g. diseases, from different angles. However, in order to get a systematic view on disease it is necessary to combine these heterogeneous data. The large amounts of diverse data requires building machine learning models that can help, e.g. to identify which genes are related to disease. Additionally, there is a need to compose reliable integrated data sets that researchers could effectively work with. In this thesis we demonstrate how to combine and analyze different types of biological data in the example of three biological domains: Alzheimer’s disease, immunology, and toxicology. More specifically, we combine data sets related to Alzheimer’s disease into a novel heterogeneous network-based data set for Alzheimer’s disease (HENA). We then apply graph convolutional networks, state-of-the-art deep learning methods, to node classification task in HENA to find genes that are potentially associated with the disease. Combining patient’s data related to immune disease helps to uncover its pathological mechanisms and to find better treatments in the future. We analyse laboratory data from patients’ skin and blood samples by combining them with clinical information. Subsequently, we bring together the results of individual analyses using available domain knowledge to form a more systematic view on the disease pathogenesis. Toxicity testing is the process of defining harmful effects of the substances for the living organisms. One of its applications is safety assessment of drugs or other chemicals for a human organism. In this work we identify groups of toxicants that have similar mechanism of actions. Additionally, we develop a classification model that allows to assess toxic actions of unknown compounds.https://www.ester.ee/record=b523255
    • …
    corecore