49 research outputs found
Machine-learning methods applied to integrated transcriptomic data from bovine blastocysts and elongating conceptuses to identify genes predictive of embryonic competence
Early pregnancy loss markedly impacts reproductive efficiency in cattle. The objectives were to model a biologically relevant gene signature predicting embryonic competence for survival after integrating transcriptomic data from blastocysts and elongating conceptuses with different developmental capacities and to validate the potential biomarkers with independent embryonic data sets through the application of machine-learning algorithms. First, two data sets from in vivo-produced blastocysts competent or not to sustain a pregnancy were integrated with a data set from long and short day-15 conceptuses. A statistical contrast determined differentially expressed genes (DEG) increasing in expression from a competent blastocyst to a long conceptus and vice versa; these were enriched for KEGG pathways related to glycolysis/gluconeogenesis and RNA processing, respectively. Next, the most discriminative DEG between blastocysts that resulted or did not in pregnancy were selected by linear discriminant analysis. These eight putative biomarker genes were validated by modeling their expression in competent or noncompetent blastocysts through Bayesian logistic regression or neural networks and predicting embryo developmental fate in four external data sets consisting of in vitro-produced blastocysts (i) competent or not, or (ii) exposed or not to detrimental conditions during culture, and elongated conceptuses (iii) of different length, or (iv) developed in the uteri of high- or subfertile heifers. Predictions for each data set were more than 85% accurate, suggesting that these genes play a key role in embryo development and pregnancy establishment. In conclusion, this study integrated transcriptomic data from seven independent experiments to identify a small set of genes capable of predicting embryonic competence for survival
Making Complex Prediction Rules Applicable for Readers: Current Practice in Random Forest Literature and Recommendations
Ideally, prediction rules (including classifiers as a special case) should be published in such a way that readers may apply them, for example to make predictions for their own data. While this is straightforward for simple prediction rules, such as those based on the logistic regression model, this is much more difficult for complex prediction rules derived by machine learning tools. We conducted a survey of articles reporting prediction rules that were constructed using the random forest algorithm and published in PLOS ONE in 2014-2015 with the aim to identify issues related to their applicability. The presented prediction rules were applicable in only 2 of 30 identified papers, while for further 8 prediction rules it was possible to obtain the necessary information by contacting the authors. Various problems, such as non-response of the authors, hampered the applicability of prediction rules in the other cases. Based on our experiences from the survey, we formulate a set of recommendations for authors publishing complex prediction rules to ensure their applicability for readers
Improving cross-study prediction through addon batch effect adjustment or addon normalization
International audienceMotivation: To date most medical tests derived by applying classification methods to high-dimensional molecular data are hardly used in clinical practice. This is partly because the prediction error resulting when applying them to external data is usually much higher than internal error as evaluated through within-study validation procedures. We suggest the use of addon normaliza-tion and addon batch effect removal techniques in this context to reduce systematic differences between external data and the original dataset with the aim to improve prediction performance. Results: We evaluate the impact of addon normalization and seven batch effect removal methods on cross-study prediction performance for several common classifiers using a large collection of microarray gene expression datasets, showing that some of these techniques reduce prediction error. Availability and Implementation: All investigated addon methods are implemented in our R package bapred
Fine spatial scale modelling of Trentino past forest landscape and future change scenarios to study ecosystem services through the years
Ciolli, MarcoCantiani, Maria Giulia1openLandscape in Europe has dramatically changed in the last decades. This has been especially
true for Alpine regions, where the progressive urbanization of the valleys has been accom-
panied by the abandonment of smaller villages and areas at higher elevation. This trend
has been clearly observable in the Provincia Autonoma di Trento (PAT) region in the Italian
Alps. The impact has been substantial for many rural areas, with the progressive shrinking
of meadows and pastures due to the forest natural recolonization. These modifications of the
landscape affect biodiversity, social and cultural dynamics, including landscape perception
and some ecosystem services. Literature review showed that this topic has been addressed
by several authors across the Alps, but their researches are limited in space coverage, spatial
resolution and time span. This thesis aims to create a comprehensive dataset of historical
maps and multitemporal orthophotos in the area of PAT to perform data analysis to identify
the changes in forest and open areas, being an evaluation of how these changes affected land-
scape structure and ecosystems, create a future change scenario for a test area and highlight
some major changes in ecosystem services through time.
In this study a high resolution dataset of maps covering the whole PAT area for over
a century was developed. The earlier representation of the PAT territory which contained
reliable data about forest coverage was considered is the Historic Cadastral maps of the 1859.
These maps in fact systematically and accurately represented the land use of each parcel in
the Habsburg Empire, included the PAT. Then, the Italian Kingdom Forest Maps, was the
next important source of information about the forest coverage after World War I, before
coming to the most recent datasets of the greyscale images of 1954, 1994 and the multiband
images of 2006 and 2015.
The purpose of the dataset development is twofold: to create a series of maps describing
the forest and open areas coverage in the last 160 years for the whole PAT on one hand and
to setup and test procedures to extract the relevant information from imagery and historical
maps on the other. The datasets were archived, processed and analysed using the Free and
Open Source Software (FOSS) GIS GRASS, QGIS and R.
The goal set by this work was achieved by a remote sensed analysis of said maps and
aerial imagery. A series of procedures were applied to extract a land use map, with the forest
categories reaching a level of detail rarely achieved for a study area of such an extension
(6200 km2
). The resolution of the original maps is in fact at a meter level, whereas the coarser
resampling adopted is 10mx10m pixels.
The great variety and size of the input data required the development, along the main part
of the research, of a series of new tools for automatizing the analysis of the aerial imagery,
to reduce the user intervention. New tools for historic map classification were as well developed, for eliminating from the resulting maps of land use from symbols (e.g.: signs), thus
enhancing the results.
Once the multitemporal forest maps were obtained, the second phase of the current work
was a qualitative and quantitative assessment of the forest coverage and how it changed.
This was performed by the evaluation of a number of landscape metrics, indexes used to
quantify the compaction or the rarefaction of the forest areas.
A recurring issue in the current Literature on the topic of landscape metrics was identified
along their analysis in the current work, that was extensively studied. This highlighted the
importance of specifying some parameters in the most used landscape fragmentation analy-
sis software to make the results of different studies properly comparable.
Within this analysis a set of data coming from other maps were used to characterize the process of afforestation in PAT, such as the potential forest maps, which were used to quantify
the area of potential forest which were actually afforested through the years, the Digital Ele-
vation Model, which was used to quantify the changes in forest area at a different ranges of
altitude, and finally the forest class map, which was used to estimate how afforestation has
affected each single forest type.
The output forest maps were used to analyse and estimate some ecosystem services, in par-
ticular the protection from soil erosion, the changes in biodiversity and the landscape of the
forests.
Finally, a procedure for the analysis of future changes scenarios was set up to study how
afforestation will proceed in absence of external factors in a protected area of PAT. The pro-
cedure was developed using Agent Based Models, which considers trees as thinking agents,
able to choose where to expand the forest area.
The first part of the results achieved consists in a temporal series of maps representing the
situation of the forest in each year of the considered dataset. The analysis of these maps
suggests a trend of afforestation across the PAT territory. The forest maps were then reclassi-
fied by altitude ranges and forest types to show how the afforestation proceeded at different
altitudes and forest types. The results showed that forest expansion acted homogeneously
through different altitude and forest types. The analysis of a selected set of landscape met-
rics showed a progressive compaction of the forests at the expenses of the open areas, in each
altitude range and for each forest type. This generated on one hand a benefit for all those
ecosystem services linked to a high forest cover, while reduced ecotonal habitats and affected
biodiversity distribution and quality. Finally the ABM procedure resulted in a set of maps
representing a possible evolution of the forest in an area of PAT, which represented a similar
situation respect to other simulations developed using different models in the same area. A
second part of the result achieved in the current work consisted in new open source tools
for image analysis developed for achieving the results showed, but with a potentially wider
field of application, along with new procedure for the evaluation of the image classification.
The current work fulfilled its aims, while providing in the meantime new tools and enhance-
ment of existing tools for remote sensing and leaving as heritage a large dataset that will be
used to deepen he knowledge of the territory of PAT, and, more widely to study emerging
pattern in afforestation in an alpine environment.openGobbi, S
Parallel Computing for Biological Data
In the 1990s a number of technological innovations appeared that revolutionized biology, and 'Bioinformatics' became a new scientific discipline. Microarrays can measure the abundance of tens of thousands of mRNA species, data on the
complete genomic sequences of many different organisms are available, and other technologies make it possible to study various processes at the molecular level. In Bioinformatics and Biostatistics, current research and computations are limited by the available computer hardware. However, this problem can be solved using high-performance computing resources. There are several reasons for the increased focus on high-performance computing: larger data sets,
increased computational requirements stemming from more sophisticated methodologies, and latest developments in computer chip production.
The open-source programming language 'R' was developed to provide a powerful and extensible environment for statistical and graphical techniques. There are many
good reasons for preferring R to other software or programming languages for scientific computations (in statistics and biology). However, the development of
the R language was not aimed at providing a software for parallel or high-performance computing. Nonetheless, during the last decade, a great deal of research has been conducted on using parallel computing techniques with R.
This PhD thesis demonstrates the usefulness of the R language and parallel computing for biological research. It introduces parallel computing with R, and reviews and evaluates existing techniques and R packages for parallel computing on Computer Clusters, on Multi-Core Systems, and in Grid Computing. From a computer-scientific point of view the packages were examined as to their reusability in biological applications, and some upgrades were proposed.
Furthermore, parallel applications for next-generation sequence data and preprocessing of microarray data were developed. Microarray data are characterized by high levels of noise and bias. As these perturbations have to be removed, preprocessing of raw data has been a research topic of high priority over the past few years. A new Bioconductor package called affyPara for parallelized preprocessing of high-density oligonucleotide microarray data was developed and published. The partition of data can be performed on arrays using a block cyclic partition, and, as a result, parallelization of algorithms becomes directly
possible. Existing statistical algorithms and data structures had to be adjusted and reformulated for the use in parallel computing. Using the new parallel infrastructure, normalization methods can be enhanced and new methods became available. The partition of data and distribution to several nodes or processors solves the main memory problem and accelerates the methods by up to the factor fifteen for 300 arrays or more.
The final part of the thesis contains a huge cancer study analysing more than 7000 microarrays from a publicly available database, and estimating gene interaction networks. For this purpose, a new R package for microarray data management was developed, and various challenges regarding the analysis of this amount of data are discussed. The comparison of gene networks for different
pathways and different cancer entities in the new amount of data partly confirms already established forms of gene interaction
Zero-Sum Regression Scale Invariant Molecular Data Analysis
In biomedicine, it is still an outstanding issue that the absolute scale of omics data gets lost due to
technical limitations. This causes that the original scale first has to be approximated by normalization techniques before analysis methods can be applied. However, there are competing normalization strategies based on different assumptions about the structure of the underlying data. Due to these different assumptions, normalization methods can yield different results, which can also affect the outcome of concluding analysis methods. Thus, another concept is to resolve this issue by using scale invariant data analysis methods.
This thesis shows how generalized linear regression methods can be extended with a scale invariance for log-transformed omics data by enforcing an additional constraint called zero-sum. Therefore, an efficient coordinate descent algorithm is developed and the advantages of this approach shown in simulations and on omics data. The corresponding open source software zeroSum is available at https://github.com/rehbergT/zeroSum