5 research outputs found
biclustermd: An R Package for Biclustering with Missing Values
Biclustering is a statistical learning technique that attempts to find homogeneous partitions of rows and columns of a data matrix. For example, movie ratings might be biclustered to group both raters and movies. biclust is a current R package allowing users to implement a variety of biclustering algorithms. However, its algorithms do not allow the data matrix to have missing values. We provide a new R package, biclustermd, which allows users to perform biclustering on numeric data even in the presence of missing values
Optimizing the design of planting experiments for agricultural crops
In commercial breeding, new genotypes are constantly being created and need to be tested to understand how a specific seed will perform in its target locations. A major constraint is that a genotype needs to go through multiple years of testing before it can be commercialized. With the volume of new genotypes that are constantly being enhanced, it is unrealistic to test every genotype in every target environment. Here, a methodology has been created that considers the fact that there are limited resources, whether it be limited space or a limited number of each genotype available in a single planting season. This new approach works by using the observations of genotypes that were planted and then inferring the performance of specific genotypes in certain environments. For agricultural crops, not all genotypes respond in the same way when planted in a certain environment. This phenomenon is describing genotype by environment (GxE) interaction. Numerous methods exist that aim to predict plant performance and specifically quantify and understand the GxE interaction. Here, five models are first evaluated on four different crop datasets. The Biclustering model is one model considered and it is effective at determining which genotypes have no GxE interaction in a subset of environments. This model works well with sparse data which is what exists in practice. Therefore, the Biclustering model is used to find subsets of genotypes and environments that have little to no GxE interaction.
In a subset of genotypes and environments with no interaction, genotypes can be planted in a strategic, methodical pattern so that the phenotype of unplanted genotypes can be inferred. Depending on the amount of physical resources available, two approaches can be utilized to gain information about unplanted genotypes. Given a set number of genotypes that can be planted, the first approach aims to maximize the number of known genotype/ environment pairs. The term genotype/environment pair refers to the phenotype that exists for a single genotype in a single environment. The second approach determines how many observations are required to infer every genotype/environment pair within a dataset. Additional constraints can be introduced to create a more realistic model.
The effectiveness of these two approaches can be illustrated using small-scale experimental designs that can be translated to full-scale commercial cases. In order to evaluate the effectiveness of the experimental designs created, both optimized and random models are compared to the original phenotypic responses. Validation indicates that optimizing the location of genotypes allows more inferences to be made, implying that creating an optimized planting plan can improve the understanding of genotypes. If this approach is applied in practice, it can facilitate further research as additional information can be gained from existing resources
Recommended from our members
Spatial and temporal patterns in Holocene wildfire responses to environmental change in the northern extratropics
Fire is an important environmental process in the northern extratropics (NET), with various regions
predicted to experience the highest magnitude increases in fire activity compared to other global regions in future. Previous NET palaeofire studies are limited by poor data availability and a lack of
quantitative methods. A synthesis of charcoal records is conducted to reconstruct sub-continentalscale Holocene fire histories across the NET (>45°N) and to understand their environmental controls.
A circum-NET-scale analysis, and a more spatially resolved analysis at the European scale (n of 21
regions) are conducted. At the NET scale, simulated palaeo climate and plant productivity data are
used in a novel clustering method to define a stratification that delineates spatial units of coherent
fire-relevant environmental change. At the European scale, this is done using pollen-based reconstructions of Holocene forest cover, summer temperature and precipitation change. Fire histories are
reconstructed by aggregating charcoal records from the Reading Palaeofire Database within clusters.
Fire reconstructions are correlated with climate and land cover reconstructions at 4000-year intervals.
Fire responses of 20 regions show correlation values of >= |0.75| with at least one environmental
variable for at least one 4000-year interval. Across Europe, fire increased over the Holocene, initially
in response to the Fennoscandian Ice Sheet collapse and associated climate drying and forestation.
Mid-to-late Holocene fire increases were caused by forest compositional shifts, human deforestation,
and agricultural expansion. Across North America, the early-Holocene collapse of the Laurentide Ice
Sheet caused continent-wide productivity increases leading to fire increases. A subsequent long-term
moisture increase drove late-Holocene fire declines across most of the continent. In central Asia, a
general Holocene-wide moisture increase drove a long-term fire decline. The results support previous study showing that sub-continental palaeofire histories in the NET are explained by variations
in climate variables influencing fuel moisture and load, but that these effects can be modulated by
land cover processes influencing fuel structure and composition. The results provide a basis for spatial prediction of fire regime changes in response to future climate, vegetation and human land use
processes
Mejora de métodos de análisis de datos con aplicación en datos biomédicos
Programa de Doctorado en Biotecnología, Ingeniería y Tecnología QuímicaLínea de Investigación: Ingeniería, Ciencia de Datos y BioinformáticaClave Programa: DBICódigo Línea: 111Hoy en día, el volumen de datos está creciendo con rapidez en una multitud de campos científicos como, por ejemplo, el campo biomédico. Con el aumento continuo del tamaño de las bases de datos, muchos enfoques tradicionales para el análisis de datos biológicos y biomédicos tienen como importante desafío el analizar esta gran cantidad de datos dentro de un tiempo razonable. Por este motivo, es evidente la necesidad de desarrollar nuevos métodos computacionales que puedan soportar el volumen, la variedad, la velocidad y la veracidad que caracterizan a estos tipos de datos. Las técnicas de aprendizaje automático y, más concretamente, las técnicas de Biclustering, se han convertido en una herramienta esencial para el análisis de este tipo de datos en cualquier tipo de estudio.
Las nuevas características que definen los tipos de datos citados anteriormente, así como las decisiones incorrectas a la hora de gestionar los recursos computacionales hardware y software, hacen que las técnicas de Biclustering no sean aún eficientes a pesar de haber realizado grandes avances durante los últimos años para acelerar su rendimiento computacional. Por otro lado, cuanto mayor sea el volumen de datos, mayor será el número de posibles soluciones. Por lo que, desde la perspectiva del usuario final, realizar un análisis o validación de una cantidad ingente de soluciones biológicas se vuelve extremadamente desafiante.
Esta tesis presenta tres principales aportaciones denominadas biGO, gBiBit y gMSR. biGO es una herramienta web de análisis de enriquecimiento de genes que permite obtener y mejorar el conocimiento biológico útil a partir de un conjunto de biclusters de entrada. Una de las mejoras de conocimiento biológico útil radica en que a través de un análisis visual, en forma de grafo interactivo, podemos determinar conexiones funcionales no sólo a nivel de términos biológicos de un mismo bicluster, sino, conocer las interconexiones funcionales entre los múltiples biclusters que intervienen en el experimento.
El segundo trabajo denominado gBiBit es un algoritmo de Biclustering que ha sido diseñado para utilizar al máximo los recursos computacionales que ofrece un clúster de dispositivos GPU. El uso de dispositivos GPU ofrece una mejora sustancial del rendimiento computacional, pero, por su tecnología, no garantiza que puedan procesar grandes conjuntos de datos. El algoritmo que se presenta en esta tesis ha elaborado una metodología que no sólo permite ofrecer resultados en un tiempo razonable sino que es capaz de procesar grandes conjuntos de datos superando las limitaciones de estos dispositivos y que en otros trabajos sí que se ven representados.
gMSR es una versión de la medida de proximidad MSR y que utiliza un clúster de dispositivos GPU para acelerar el rendimiento computacional de la medida original y ser capaz de validar la bondad de una cantidad ingente de biclusters. Hasta donde sabemos, esta tecnología aún no ha sido utilizada en ninguna técnica de validación de biclusters.
Gracias a los trabajos propuestos, esta tesis doctoral aporta a la comunidad científica un mayor conocimiento sobre cómo los métodos computacionales deben adaptarse para permitir generar sus resultados en un tiempo razonable a partir de grandes conjuntos de datos biomédicos. Por otro lado, existen tecnologías de computación de alto rendimiento (HPC) que hasta ahora únicamente fueron utilizados para acelerar el rendimiento computacional de estos métodos computacionales como, por ejemplo, los dispositivos GPU. En esta tesis doctoral, se demuestra cómo los dispositivos GPU pueden ser igualmente utilizados para que los métodos computacionales puedan estos grandes conjuntos de datos biomédicos.Universidad Pablo de Olavide de Sevilla. Departamento de Deporte e Informátic