49 research outputs found

    Preparation of high-dimensional biomedical data with a focus on prediction and error estimation

    Get PDF

    Preparation of high-dimensional biomedical data with a focus on prediction and error estimation

    Get PDF

    Machine-learning methods applied to integrated transcriptomic data from bovine blastocysts and elongating conceptuses to identify genes predictive of embryonic competence

    Get PDF
    Early pregnancy loss markedly impacts reproductive efficiency in cattle. The objectives were to model a biologically relevant gene signature predicting embryonic competence for survival after integrating transcriptomic data from blastocysts and elongating conceptuses with different developmental capacities and to validate the potential biomarkers with independent embryonic data sets through the application of machine-learning algorithms. First, two data sets from in vivo-produced blastocysts competent or not to sustain a pregnancy were integrated with a data set from long and short day-15 conceptuses. A statistical contrast determined differentially expressed genes (DEG) increasing in expression from a competent blastocyst to a long conceptus and vice versa; these were enriched for KEGG pathways related to glycolysis/gluconeogenesis and RNA processing, respectively. Next, the most discriminative DEG between blastocysts that resulted or did not in pregnancy were selected by linear discriminant analysis. These eight putative biomarker genes were validated by modeling their expression in competent or noncompetent blastocysts through Bayesian logistic regression or neural networks and predicting embryo developmental fate in four external data sets consisting of in vitro-produced blastocysts (i) competent or not, or (ii) exposed or not to detrimental conditions during culture, and elongated conceptuses (iii) of different length, or (iv) developed in the uteri of high- or subfertile heifers. Predictions for each data set were more than 85% accurate, suggesting that these genes play a key role in embryo development and pregnancy establishment. In conclusion, this study integrated transcriptomic data from seven independent experiments to identify a small set of genes capable of predicting embryonic competence for survival

    Making Complex Prediction Rules Applicable for Readers: Current Practice in Random Forest Literature and Recommendations

    Get PDF
    Ideally, prediction rules (including classifiers as a special case) should be published in such a way that readers may apply them, for example to make predictions for their own data. While this is straightforward for simple prediction rules, such as those based on the logistic regression model, this is much more difficult for complex prediction rules derived by machine learning tools. We conducted a survey of articles reporting prediction rules that were constructed using the random forest algorithm and published in PLOS ONE in 2014-2015 with the aim to identify issues related to their applicability. The presented prediction rules were applicable in only 2 of 30 identified papers, while for further 8 prediction rules it was possible to obtain the necessary information by contacting the authors. Various problems, such as non-response of the authors, hampered the applicability of prediction rules in the other cases. Based on our experiences from the survey, we formulate a set of recommendations for authors publishing complex prediction rules to ensure their applicability for readers

    Improving cross-study prediction through addon batch effect adjustment or addon normalization

    Get PDF
    International audienceMotivation: To date most medical tests derived by applying classification methods to high-dimensional molecular data are hardly used in clinical practice. This is partly because the prediction error resulting when applying them to external data is usually much higher than internal error as evaluated through within-study validation procedures. We suggest the use of addon normaliza-tion and addon batch effect removal techniques in this context to reduce systematic differences between external data and the original dataset with the aim to improve prediction performance. Results: We evaluate the impact of addon normalization and seven batch effect removal methods on cross-study prediction performance for several common classifiers using a large collection of microarray gene expression datasets, showing that some of these techniques reduce prediction error. Availability and Implementation: All investigated addon methods are implemented in our R package bapred

    Fine spatial scale modelling of Trentino past forest landscape and future change scenarios to study ecosystem services through the years

    Get PDF
    Ciolli, MarcoCantiani, Maria Giulia1openLandscape in Europe has dramatically changed in the last decades. This has been especially true for Alpine regions, where the progressive urbanization of the valleys has been accom- panied by the abandonment of smaller villages and areas at higher elevation. This trend has been clearly observable in the Provincia Autonoma di Trento (PAT) region in the Italian Alps. The impact has been substantial for many rural areas, with the progressive shrinking of meadows and pastures due to the forest natural recolonization. These modifications of the landscape affect biodiversity, social and cultural dynamics, including landscape perception and some ecosystem services. Literature review showed that this topic has been addressed by several authors across the Alps, but their researches are limited in space coverage, spatial resolution and time span. This thesis aims to create a comprehensive dataset of historical maps and multitemporal orthophotos in the area of PAT to perform data analysis to identify the changes in forest and open areas, being an evaluation of how these changes affected land- scape structure and ecosystems, create a future change scenario for a test area and highlight some major changes in ecosystem services through time. In this study a high resolution dataset of maps covering the whole PAT area for over a century was developed. The earlier representation of the PAT territory which contained reliable data about forest coverage was considered is the Historic Cadastral maps of the 1859. These maps in fact systematically and accurately represented the land use of each parcel in the Habsburg Empire, included the PAT. Then, the Italian Kingdom Forest Maps, was the next important source of information about the forest coverage after World War I, before coming to the most recent datasets of the greyscale images of 1954, 1994 and the multiband images of 2006 and 2015. The purpose of the dataset development is twofold: to create a series of maps describing the forest and open areas coverage in the last 160 years for the whole PAT on one hand and to setup and test procedures to extract the relevant information from imagery and historical maps on the other. The datasets were archived, processed and analysed using the Free and Open Source Software (FOSS) GIS GRASS, QGIS and R. The goal set by this work was achieved by a remote sensed analysis of said maps and aerial imagery. A series of procedures were applied to extract a land use map, with the forest categories reaching a level of detail rarely achieved for a study area of such an extension (6200 km2 ). The resolution of the original maps is in fact at a meter level, whereas the coarser resampling adopted is 10mx10m pixels. The great variety and size of the input data required the development, along the main part of the research, of a series of new tools for automatizing the analysis of the aerial imagery, to reduce the user intervention. New tools for historic map classification were as well developed, for eliminating from the resulting maps of land use from symbols (e.g.: signs), thus enhancing the results. Once the multitemporal forest maps were obtained, the second phase of the current work was a qualitative and quantitative assessment of the forest coverage and how it changed. This was performed by the evaluation of a number of landscape metrics, indexes used to quantify the compaction or the rarefaction of the forest areas. A recurring issue in the current Literature on the topic of landscape metrics was identified along their analysis in the current work, that was extensively studied. This highlighted the importance of specifying some parameters in the most used landscape fragmentation analy- sis software to make the results of different studies properly comparable. Within this analysis a set of data coming from other maps were used to characterize the process of afforestation in PAT, such as the potential forest maps, which were used to quantify the area of potential forest which were actually afforested through the years, the Digital Ele- vation Model, which was used to quantify the changes in forest area at a different ranges of altitude, and finally the forest class map, which was used to estimate how afforestation has affected each single forest type. The output forest maps were used to analyse and estimate some ecosystem services, in par- ticular the protection from soil erosion, the changes in biodiversity and the landscape of the forests. Finally, a procedure for the analysis of future changes scenarios was set up to study how afforestation will proceed in absence of external factors in a protected area of PAT. The pro- cedure was developed using Agent Based Models, which considers trees as thinking agents, able to choose where to expand the forest area. The first part of the results achieved consists in a temporal series of maps representing the situation of the forest in each year of the considered dataset. The analysis of these maps suggests a trend of afforestation across the PAT territory. The forest maps were then reclassi- fied by altitude ranges and forest types to show how the afforestation proceeded at different altitudes and forest types. The results showed that forest expansion acted homogeneously through different altitude and forest types. The analysis of a selected set of landscape met- rics showed a progressive compaction of the forests at the expenses of the open areas, in each altitude range and for each forest type. This generated on one hand a benefit for all those ecosystem services linked to a high forest cover, while reduced ecotonal habitats and affected biodiversity distribution and quality. Finally the ABM procedure resulted in a set of maps representing a possible evolution of the forest in an area of PAT, which represented a similar situation respect to other simulations developed using different models in the same area. A second part of the result achieved in the current work consisted in new open source tools for image analysis developed for achieving the results showed, but with a potentially wider field of application, along with new procedure for the evaluation of the image classification. The current work fulfilled its aims, while providing in the meantime new tools and enhance- ment of existing tools for remote sensing and leaving as heritage a large dataset that will be used to deepen he knowledge of the territory of PAT, and, more widely to study emerging pattern in afforestation in an alpine environment.openGobbi, S

    Parallel Computing for Biological Data

    Get PDF
    In the 1990s a number of technological innovations appeared that revolutionized biology, and 'Bioinformatics' became a new scientific discipline. Microarrays can measure the abundance of tens of thousands of mRNA species, data on the complete genomic sequences of many different organisms are available, and other technologies make it possible to study various processes at the molecular level. In Bioinformatics and Biostatistics, current research and computations are limited by the available computer hardware. However, this problem can be solved using high-performance computing resources. There are several reasons for the increased focus on high-performance computing: larger data sets, increased computational requirements stemming from more sophisticated methodologies, and latest developments in computer chip production. The open-source programming language 'R' was developed to provide a powerful and extensible environment for statistical and graphical techniques. There are many good reasons for preferring R to other software or programming languages for scientific computations (in statistics and biology). However, the development of the R language was not aimed at providing a software for parallel or high-performance computing. Nonetheless, during the last decade, a great deal of research has been conducted on using parallel computing techniques with R. This PhD thesis demonstrates the usefulness of the R language and parallel computing for biological research. It introduces parallel computing with R, and reviews and evaluates existing techniques and R packages for parallel computing on Computer Clusters, on Multi-Core Systems, and in Grid Computing. From a computer-scientific point of view the packages were examined as to their reusability in biological applications, and some upgrades were proposed. Furthermore, parallel applications for next-generation sequence data and preprocessing of microarray data were developed. Microarray data are characterized by high levels of noise and bias. As these perturbations have to be removed, preprocessing of raw data has been a research topic of high priority over the past few years. A new Bioconductor package called affyPara for parallelized preprocessing of high-density oligonucleotide microarray data was developed and published. The partition of data can be performed on arrays using a block cyclic partition, and, as a result, parallelization of algorithms becomes directly possible. Existing statistical algorithms and data structures had to be adjusted and reformulated for the use in parallel computing. Using the new parallel infrastructure, normalization methods can be enhanced and new methods became available. The partition of data and distribution to several nodes or processors solves the main memory problem and accelerates the methods by up to the factor fifteen for 300 arrays or more. The final part of the thesis contains a huge cancer study analysing more than 7000 microarrays from a publicly available database, and estimating gene interaction networks. For this purpose, a new R package for microarray data management was developed, and various challenges regarding the analysis of this amount of data are discussed. The comparison of gene networks for different pathways and different cancer entities in the new amount of data partly confirms already established forms of gene interaction

    Zero-Sum Regression Scale Invariant Molecular Data Analysis

    Get PDF
    In biomedicine, it is still an outstanding issue that the absolute scale of omics data gets lost due to technical limitations. This causes that the original scale first has to be approximated by normalization techniques before analysis methods can be applied. However, there are competing normalization strategies based on different assumptions about the structure of the underlying data. Due to these different assumptions, normalization methods can yield different results, which can also affect the outcome of concluding analysis methods. Thus, another concept is to resolve this issue by using scale invariant data analysis methods. This thesis shows how generalized linear regression methods can be extended with a scale invariance for log-transformed omics data by enforcing an additional constraint called zero-sum. Therefore, an efficient coordinate descent algorithm is developed and the advantages of this approach shown in simulations and on omics data. The corresponding open source software zeroSum is available at https://github.com/rehbergT/zeroSum
    corecore