11 research outputs found

    Siirto-oppimista ryhmäfaktorianalyysilla

    Get PDF
    Modern measuring techniques allow us to get more and more data in less time and cheaper price. When analyzing data, one sample might be the gene expression of a cell or the activity of a human brain at a certain time, consisting of tens of thousands of features. Often we have much fewer samples than features, and simple methods will overfit the data. Factor models are designed to model this kind of high-dimensional data via a lower dimensional factor space. Factor analysis is the simplest factor model: it reconstructs each feature in the data as a weighted sum of the hidden factors (components). In this thesis I examine group factor analysis (GFA), which is an extension of factor analysis for multiple data sets. High-dimensional data can often be naturally divided to different groups (views), which GFA uses as prior information by inferring the component activities for views instead of single features. This property combined with an automatic system for the component activity determination results in a powerful factor model. In this thesis, GFA is extended to explicitly model hidden relations between different data views. This is done by generating their component activity matrix in two alternative ways: as samples of a multivariate normal distribution and as a product of two low-rank matrices. Both the extensions are solved via variational Bayesian inference, and are shown to model data with accuracy comparable to GFA. For data with many views low-rank GFA is the most accurate model. Additionally the problem of small number of samples is dealt with two transfer learning setups: one being able to take advantage of background data with samples or features shared with target data, and the other introducing a novel transfer learning setup. It is shown, using both artificial and real data, that both of these setups allow us to form a better model when suitable background data is available. The real data consists of drug response profiles measured on cell lines using two different microarray platforms.Modernien mittaustekniikoiden avulla saadaan nykyään entistä enemmän aineistoa tutkittavaksi lyhyemmässä ajassa ja halvemmalla. Kun tutkimuksen kohteena ovat esimerkiksi solun geenien ilmentymisarvot tai ihmisaivojen toiminta, yksi näyte voi koostua kymmenistä tuhansista muuttujista. Usein näytteitä on paljon vähemmän kuin muuttujia, jolloin yksinkertaiset menetelmät ylisovittuvat aineistoon. Faktorimallit on suunniteltu mallintamaan tällaista korkeaulotteista dataa matalaulotteisemman faktoriavaruuden avulla. Faktorianalyysi on näistä malleista yksinkertaisin: se rekonstruoi jokaisen aineiston muuttujan latenttien faktorien (komponenttien) painotettuna summana. Tässä diplomityössä sovelletaan ja edelleen kehitetään ryhmäfaktorianalyysiä (GFA), joka on faktorianalyysin laajennus useille aineistojoukoille. Korkeaulotteinen data voidaan usein jakaa ryhmiin (näkymiin), jotka GFA ottaa huomioon mallintamalla komponenttiaktiivisuudet ryhmille yksittäisten muuttujien sijaan. Mallissa on myös mukana komponenttien relevanssin määrittävä osa. Nämä seikat tekevät GFA:sta käytännöllisen faktorimallin. Tässä työssä laajennetaan ryhmäfaktorianalyysiä mallintamaan aineiston eri näkymien suhteita eksplisiittisesti. Tämä tehdään mallintamalla näkymien komponenttiaktiivisuudet kahdella vaihtoehtoisella tavalla: moniulotteisen normaalijakauman näytteinä sekä kahden matalan rangin matriisin tulona. Molemmat laajennukset ratkaistaan variationaalisen Bayes-päättelyn avulla, ja niiden tarkkuus aineiston mallintamisessa vastaa GFA:n tarkkuutta. Aineistossa, jossa on useita näkymiä, matalan rangin GFA on tarkin malli. Pienen näytemäärän ongelmaan puututaan lisäksi kahdella siirto-oppimismenetelmällä. Toisessa hyödynnetään taustadataa, jossa on kohdedatan kanssa jaettuja näytteitä tai muuttujia. Toisessa lähestymistavassa on menetelmänä syvemmän tason siirto-oppiminen. Työssä osoitetaan sekä keinotekoisella että oikealla aineistolla, että molemmat menetelmät parantavat lopullista mallia, kunhan sopivaa taustadataa on saatavilla. Oikea aineisto koostuu solulinjoille mikrosiruilla tehdyistä lääkevastemittauksista

    Distributed Bayesian Matrix Factorization with Limited Communication

    Full text link
    Bayesian matrix factorization (BMF) is a powerful tool for producing low-rank representations of matrices and for predicting missing values and providing confidence intervals. Scaling up the posterior inference for massive-scale matrices is challenging and requires distributing both data and computation over many workers, making communication the main computational bottleneck. Embarrassingly parallel inference would remove the communication needed, by using completely independent computations on different data subsets, but it suffers from the inherent unidentifiability of BMF solutions. We introduce a hierarchical decomposition of the joint posterior distribution, which couples the subset inferences, allowing for embarrassingly parallel computations in a sequence of at most three stages. Using an efficient approximate implementation, we show improvements empirically on both real and simulated data. Our distributed approach is able to achieve a speed-up of almost an order of magnitude over the full posterior, with a negligible effect on predictive accuracy. Our method outperforms state-of-the-art embarrassingly parallel MCMC methods in accuracy, and achieves results competitive to other available distributed and parallel implementations of BMF.Comment: 28 pages, 8 figures. The paper is published in Machine Learning journal. An implementation of the method is is available in SMURFF software on github (bmfpp branch): https://github.com/ExaScience/smurf

    Group Factor Analysis

    Full text link
    Factor analysis provides linear factors that describe relationships between individual variables of a data set. We extend this classical formulation into linear factors that describe relationships between groups of variables, where each group represents either a set of related variables or a data set. The model also naturally extends canonical correlation analysis to more than two sets, in a way that is more flexible than previous extensions. Our solution is formulated as variational inference of a latent variable model with structural sparsity, and it consists of two hierarchical levels: The higher level models the relationships between the groups, whereas the lower models the observed variables given the higher level. We show that the resulting solution solves the group factor analysis problem accurately, outperforming alternative factor analysis based solutions as well as more straightforward implementations of group factor analysis. The method is demonstrated on two life science data sets, one on brain activation and the other on systems biology, illustrating its applicability to the analysis of different types of high-dimensional data sources

    The relationship between electrophysiological and hemodynamic measures of neural activity varies across picture naming tasks: A multimodal magnetoencephalography-functional magnetic resonance imaging study

    Get PDF
    Different neuroimaging methods can yield different views of taskdependent neural engagement. Studies examining the relationship between electromagnetic and hemodynamic measures have revealed correlated patterns across brain regions but the role of the applied stimulation or experimental tasks in these correlation patterns is still poorly understood. Here, we evaluated the across-tasks variability of MEG-fMRI relationship using data recorded during three distinct naming tasks (naming objects and actions from action images, and objects from object images), from the same set of participants. Our results demonstrate that the MEG-fMRI correlation pattern varies according to the performed task, and that this variability shows distinct spectral profiles across brain regions. Notably, analysis of the MEG data alone did not reveal modulations across the examined tasks in the timefrequency windows emerging from the MEG-fMRI correlation analysis. Our results suggest that the electromagnetic-hemodynamic correlation could serve as a more sensitive proxy for task-dependent neural engagement in cognitive tasks than isolated within-modality measures.Peer reviewe

    Bayesiläisiä monilähdemalleja lääkevaste- ja aivokuvantamiskokeisiin

    No full text
    This thesis investigates knowledge inference from measurements of multiple data sources, motivated by technologies in a wide range of domains allowing effective measurement of several related, but heterogeneous data sources. In life sciences, examples of this kind of "multi-view" data are brain imaging data of multiple subjects along with description of the experimental stimuli, as well as drug response studies including measurements regarding the expression level, copy number variation and mutation of genes in cell lines. Data analyses have been typically related to analyzing the structure of a single data source, or the effect of one data source to another. The multi-view data inspected in this thesis results in a more complex problem: besides the structure of each of the data sources, the relations between the data sources are of high interest as well.  This thesis addresses modern multi-view data analysis problems using Bayesian latent variable models. They are a natural choice for developing models in order to gain knowledge about multiple data sources and their relations; they allow for missing values in the data, incorporating prior information to the modelling problem and estimating the uncertainty present in the inference. The key contributions of this thesis include formulating a low-rank data source relation model and presenting biclustering using sparse priors, as well as a relaxed formulation of tensor factorization. All the developed models have been published as open-source software, enabling wide-spread use and further development.  The presented machine learning tools are demonstrated using drug response and brain imaging studies, for both of which predictive performance above state-of-the-art level is achieved. In the drug response studies, the models were able to accurately relate similar drugs, as well as detect known cancer genes affecting the responsiveness of cells to certain drugs. In the brain response studies the benefits of the presented methods were shown via increased accuracy in predicting brain responses, whereas the relaxed tensor decomposition allowed for a novel way of utilizing measurements for multiple subjects. Finally, the advantage of using a low-dimensional latent space is illustrated in a genome-wide association study in an especially challenging domain: when there exist measurements for only two hundred subjects, yet there exist some thousands of features regarding the subjects, with the study discovering a relevant gene associated with components of brain activity.Tässä työssä tutkitaan tiedon hankkimista monilähdeaineistoista. Nykyään monilla aloilla on mahdollista kerätä tehokkaasti mittauksia useista toisiinsa liittyvistä mutta heterogeenisistä datalähteistä. Biotieteissä esimerkkejä tällaisista monilähdeaineistoista ovat usean koehenkilön aivokuvantamismittaukset yhdistettynä kokeessa käytetyn ärsykkeen kuvaukseen sekä lääkevastekokeet, jotka sisältävät mittauksia solulinjojen geenien ilmentymisistä, kopioiden määrästä ja mutaatioista. Data-analyysiongelmissa tutkimuskohde on tyypillisesti ollut joko yksittäisen datalähteen rakenne tai yhden datalähteen vaikutus toiseen. Tässä työssä tarkasteltuihin monilähdeaineistoihin liittyy haastavampi ongelma, sillä jokaisen lähteen sisäisen rakenteen lisäksi halutaan tarkastella myös lähteiden välisiä suhteita.  Tässä työssä monilähdedata-analyysiongelmia ratkotaan bayesiläisillä piilomuuttujamalleilla. Ne soveltuvat hyvin mallien kehittämiseen useille datalähteille ja niiden välisille suhteille; ne sallivat puuttuvat arvot aineistossa sekä mahdollistavat prioritiedon huomioon ottamisen mallintamisessa ja epävarmuuden arvioinnin mallin päättelyssä. Tärkeimpinä kontribuutioina tässä työssä esitellään matalaulotteinen suhdemalli datalähteille, demonstroidaan biklusterointia harvoilla prioreilla sekä muotoillaan relaksoitu tensorihajotelma. Kaikki kehitetyt mallit on julkaistu avoimesti, jotta niitä voidaan edelleenkehittää ja käyttää laajasti.  Esiteltyjä koneoppimismalleja sovellettiin lääkevaste- ja aivokuvantamiskokeisiin. Molemmissa sovelluksissa ylitettiin aiempi huipputaso ennustustarkkuuksissa. Lääkevastekokeissa malleilla onnistuttiin assosioimaan samankaltaisia lääkkeitä ja havaittiin tunnettuja syöpägeenejä, jotka vaikuttivat solujen herkkyyteen tietyille lääkkeille. Aivokuvantamiskokeissa esitelty relaksoitu tensorihajotelma hyödynsi useiden koehenkilöiden mittauksia uudenlaisella tavalla. Lisäksi tässä työssä osoitettiin matalaulotteisen piiloavaruuden hyödyllisyys genominlaajuisessa assosiaatiotutkimuksessa erityisen haastavassa koeasetelmassa, jossa mittauksia on vain kahdestasadasta henkilöstä ja fenotyyppi koostuu tuhansista piirteistä. Sen avulla löydettiin merkityksellinen geeni, joka selittää aivoaktiivisuuden osatekijöitä

    GFA: Exploratory Analysis of Multiple Data Sources with Group Factor Analysis Muhammad Ammad-ud-din

    No full text
    Abstract The R package GFA provides a full pipeline for factor analysis of multiple data sources that are represented as matrices with co-occurring samples. It allows learning dependencies between subsets of the data sources, decomposed into latent factors. The package also implements sparse priors for the factorization, providing interpretable biclusters of the multi-source data

    GFA

    No full text
    The R package GFA provides a full pipeline for factor analysis of multiple data sources that are represented as matrices with co-occurring samples. It allows learning dependencies between subsets of the data sources, decomposed into latent factors. The package also implements sparse priors for the factorization, providing interpretable biclusters of the multi-source data.Peer reviewe

    GFA

    No full text
    The R package GFA provides a full pipeline for factor analysis of multiple data sources that are represented as matrices with co-occurring samples. It allows learning dependencies between subsets of the data sources, decomposed into latent factors. The package also implements sparse priors for the factorization, providing interpretable biclusters of the multi-source data.Peer reviewe
    corecore