30 research outputs found

    Oppimisen kohdentaminen easymmetrisen harvuuden avulla

    Get PDF
    Useat viime vuosina kerätyt havaintoaineistot koostuvat mittauksista hyvin pienestä määrästä näytteitä. Tällaisten aineistojen mallintaminen on haasteellista, koska mallit helposti ylisovittuvat aineistoon. Ongelmaan on kehitetty useita lähestymistapoja. Pääasiallisen mallinnustehtävän rinnalle voidaan ottaa muita mallinnustehtäviä, joissa käytettävät mallit kytketään pääasiallisen tehtävän malliin. Näin mallien yhteisten osien oppimiseen on käytettävissä enemmän aineistoa, mikä parantaa tulosten yleistymistä uusiin aineistoihin. Tätä lähestymistapaa kutsutaan monitehtäväoppimiseksi. Käytettävää mallia voidaan myös rajoittaa lisäämällä siihen oletuksia, jotka rajoittavat mallin sovittumista aineistoon ja siten vähentävät ylisovittumista. Tyypilliset monitehtäväoppimista hyödyntävät mallit painottavat kaikkia oppimistehtäviä yhtä voimakkaasti, vaikka yksi oppimistehtävä on yleensä muita tärkeämpi. Tämä diplomityö on esitutkimus uudesta lähestymistavasta, joka pyrkii monitehtäväoppimisasetelmassa parantamaan yleistyvyyttä yhdessä oppimistehtävässä eri mallien sovittumiskykyä rajoittavien oletusten avulla. Valitussa oppimistehtävässä mallin sovittumista aineistoon rajoitetaan muita oppimistehtäviä enemmän mallin harvuutta lisäämällä, jotta tehtävälle opittu malli yleistyisi paremmin. Uutta lähestymistapaa tutkitaan rajaamalla tutkimuskysymys suosittuihin LDAmalleihin, joissa hyödynnetään bayesilaisia epäparametrisia priorijakaumia. Epäsymmetrisen harvuuden vaikutuksia tutkitaan tämän malliperheen avulla. Tuloksissa on havaittavissa hienovaraisia parannuksia yleistyvyyteen. Tulokset uudella mallilla ovat kilpailukykyisiä tämän hetkisten johtavien menetelmien tulosten kanssa.Modern data sets often suffer from the problem of having measurements from very few samples. The small sample size makes modeling such data sets very difficult, as models easily overfit to the data. Many approaches to alleviate the problem have been taken. One such approach is multi-task learning, a subfield of statistical machine learning, in which multiple data sets are modeled simultaneously. More generally, multiple learning tasks may be learnt simultaneously to achieve better performance in each. Another approach to the problem of having too few samples is to prevent over fitting by constraining the model by making suitable assumptions. Traditional multi-task methods treat all learning tasks and data sets equally, even thought we are usually mostly interested in learning one of them. This thesis is a case study about promoting predictive performance in a specific data set of interest in a multi-task setting by constraining the models for the learning tasks unevenly. The model for the data set of interest more sparse as compared to the models for the secondary data sets. To study the new approach, the research question is limited to the very specific and popular family of so-called topic models using Bayesian nonparametric priors. A new model is presented which enables us to study the effects of asymmetric sparsity. The effects of asymmetric sparsity are studied by using the new model on real data and toy data. Subtle beneficial effects of asymmetric sparsity are observed on toy data and the new model performs comparably to existing state-of-the-art methods on real data

    Cluster analysis to estimate the risk of preeclampsia in the high-risk Prediction and Prevention of Preeclampsia and Intrauterine Growth Restriction (PREDO) study

    Get PDF
    Objectives Preeclampsia is divided into early-onset (delivery before 34 weeks of gestation) and late-onset (delivery at or after 34 weeks) subtypes, which may rise from different etiopathogenic backgrounds. Early-onset disease is associated with placental dysfunction. Late-onset disease develops predominantly due to metabolic disturbances, obesity, diabetes, lipid dysfunction, and inflammation, which affect endothelial function. Our aim was to use cluster analysis to investigate clinical factors predicting the onset and severity of preeclampsia in a cohort of women with known clinical risk factors. Methods We recruited 903 pregnant women with risk factors for preeclampsia at gestational weeks 12(+0)-13(+6). Each individual outcome diagnosis was independently verified from medical records. We applied a Bayesian clustering algorithm to classify the study participants to clusters based on their particular risk factor combination. For each cluster, we computed the risk ratio of each disease outcome, relative to the risk in the general population. Results The risk of preeclampsia increased exponentially with respect to the number of risk factors. Our analysis revealed 25 number of clusters. Preeclampsia in a previous pregnancy (n = 138) increased the risk of preeclampsia 8.1 fold (95% confidence interval (CI) 5.711.2) compared to a general population of pregnant women. Having a small for gestational age infant (n = 57) in a previous pregnancy increased the risk of early-onset preeclampsia 17.5 fold (95%CI 2.160.5). Cluster of those two risk factors together (n = 21) increased the risk of severe preeclampsia to 23.8-fold (95%CI 5.160.6), intermediate onset (delivery between 34(+0)-36(+6) weeks of gestation) to 25.1-fold (95%CI 3.179.9) and preterm preeclampsia (delivery before 37(+0) weeks of gestation) to 16.4-fold (95%CI 2.052.4). Body mass index over 30 kg/m(2) (n = 228) as a sole risk factor increased the risk of preeclampsia to 2.1-fold (95%CI 1.13.6). Together with preeclampsia in an earlier pregnancy the risk increased to 11.4 (95%CI 4.520.9). Chronic hypertension (n = 60) increased the risk of preeclampsia 5.3-fold (95%CI 2.49.8), of severe preeclampsia 22.2-fold (95%CI 9.941.0), and risk of early-onset preeclampsia 16.7-fold (95%CI 2.057.6). If a woman had chronic hypertension combined with obesity, gestational diabetes and earlier preeclampsia, the risk of term preeclampsia increased 4.8-fold (95%CI 0.121.7). Women with type 1 diabetes mellitus had a high risk of all subgroups of preeclampsia. Conclusion The risk of preeclampsia increases exponentially with respect to the number of risk factors. Early-onset preeclampsia and severe preeclampsia have different risk profile from term preeclampsia.Peer reviewe

    Modelling G×E with historical weather information improves genomic prediction in new environments

    No full text
    MOTIVATION: Interaction between the genotype and the environment (G×E) has a strong impact on the yield of major crop plants. Although influential, taking G×E explicitly into account in plant breeding has remained difficult. Recently G×E has been predicted from environmental and genomic covariates, but existing works have not shown that generalization to new environments and years without access to in-season data is possible and practical applicability remains unclear. Using data from a Barley breeding programme in Finland, we construct an in silico experiment to study the viability of G×E prediction under practical constraints. RESULTS: We show that the response to the environment of a new generation of untested Barley cultivars can be predicted in new locations and years using genomic data, machine learning and historical weather observations for the new locations. Our results highlight the need for models of G×E: non-linear effects clearly dominate linear ones, and the interaction between the soil type anddaily rain is identified as the main driver for G×E for Barley in Finland. Our study implies that genomic selection can be used to capture the yield potential in G×E effects for future growth seasons, providing a possible means to achieve yield improvements, needed for feeding the growing population. AVAILABILITY AND IMPLEMENTATION: The data accompanied by the method code (http://research.cs.aalto.fi/pml/software/gxe/bioinformatics_codes.zip) is available in the form of kernels to allow reproducing the results. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.Peer reviewe

    Transfer learning using a nonparametric sparse topic model

    Get PDF
    In many domains data items are represented by vectors of counts: count data arises, for example, in bioinformatics or analysis of text documents represented as word count vectors. However, often the amount of data available from an interesting data source is too small to model the data source well. When several data sets are available from related sources, exploiting their similarities by transfer learning can improve the resulting models compared to modeling sources independently. We introduce a Bayesian generative transfer learning model which represents similarity across document collections by sparse sharing of latent topics controlled by an Indian buffet process. Unlike a prominent previous model, hierarchical Dirichlet process (HDP) based multi-task learning, our model decouples topic sharing probability from topic strength, making sharing of low-strength topics easier. In experiments, our model outperforms the HDP approach both on synthetic data and in first of the two case studies on text collections, and achieves similar performance as the HDP approach in the second case study.acceptedVersionPeer reviewe

    Genome-wide association studies with high-dimensional phenotypes

    No full text
    High-dimensional phenotypes hold promise for richer findings in association studies, but testing of several phenotype traits aggravates the grand challenge of association studies, that of multiple testing. Several methods have recently been proposed for testing jointly all traits in a high-dimensional vector of phenotypes, with prospect of increased power to detect small effects that would be missed if tested individually. However, the methods have rarely been compared to the extent of enabling assessment of their relative merits and setting up guidelines on which method to use, and how to use it. We compare the methods on simulated data and with a real metabolomics data set comprising 137 highly correlated variables and approximately 550,000 SNPs. Applying the methods to genome-wide data with hundreds of thousands of markers inevitably requires division of the problem into manageable parts facilitating parallel processing, parts corresponding to individual genetic variants, pathways, or genes, for example. Here we utilize a straightforward formulation according to which the genome is divided into blocks of nearby correlated genetic markers, tested jointly for association with the phenotypes. This formulation is computationally feasible, reduces the number of tests, and lets the methods take advantage of combining information over several correlated variables not only on the phenotype side, but also on the genotype side. Our experiments show that canonical correlation analysis has higher power than alternative methods, while remaining computationally tractable for routine use in the GWAS setting, provided the number of samples is sufficient compared to the numbers of phenotype and genotype variables tested. Sparse canonical correlation analysis and regression models with latent confounding factors show promising performance when the number of samples is small.Comment: 33 pages, 11 figure

    Novel indicators of oat quality

    No full text

    Using the five to fifteen-collateral informant questionnaire for retrospective assessment of childhood symptoms in adults with and without autism or ADHD

    No full text
    Due to lack of previous studies, we aimed at evaluating the use of the Five to Fifteen (FTF) questionnaire in adults with neurodevelopmental disorders (NDD) and in controls without NDD. The NDD group consisted of adults with autism spectrum disorder ASD (n = 183) or attention-deficit/hyperactivity disorder (ADHD) (n = 174) without intellectual disability, recruited from a tertiary outpatient clinic. A web survey was used to collect data from general population adult control group without NDD (n = 738). The participants were retrospectively rated by their parents regarding childhood symptoms, using five to fifteen-collateral informant questionnaire (FTF-CIQ). Adults with NDD had higher FTF-CIQ domain and subdomain scores than controls, and displayed similar test profiles as children with corresponding diagnosis in previous studies. Based on the FTF-CIQ domain scores, 84.2% of the study participants (93% of the controls; 64% of the adults with NDD) were correctly classified in a logistic regression analysis. Likewise, Receiver Operating Characteristic (ROC) curve analysis on FTF-CIQ total sum score indicated that a cut-off value of 20.50 correctly classified 90% of the controls and 67% of the clinical cases, whilst a cut-off value of 30.50 correctly classified 84% of the controls and 77% of the clinical cases. The factor analysis revealed three underlying components: learning difficulties, cognitive and executive functions; social skills and emotional/behavioural symptoms; as well as motor and perceptual skills. Whilst not designed as a diagnostic instrument, the FTF-CIQ may be useful for providing information on childhood symptoms and associated difficulties in individuals assessed for NDD as adults
    corecore