5 research outputs found
CLIM4OMICS: a geospatially comprehensive climate and multi-OMICS database for maize phenotype predictability in the United States and Canada
The performance of numerical, statistical, and data-driven diagnostic and predictive crop production modeling relies heavily on data quality for input and calibration or validation processes. This
study presents a comprehensive database and the analytics used to
consolidate it as a homogeneous, consistent, multidimensional genotype, phenotypic, and environmental database for maize phenotype modeling, diagnostics, and prediction. The data used are obtained from the Genomes to Fields (G2F) initiative, which provides multiyear genomic (G), environmental (E), and phenotypic (P) datasets that can be used to train and test crop growth models to understand the genotype by environment (GxE)
interaction phenomenon. A particular advantage of the G2F database is its
diverse set of maize genotype DNA sequences (G2F-G), phenotypic measurements (G2F-P), station-based environmental time series (mainly climatic data) observations collected during the maize-growing season (G2F-E), and metadata for each field trial (G2F-M) across the United States (US), the province of Ontario in Canada, and the state of Lower Saxony in Germany. The construction
of this comprehensive climate and genomic database incorporates the
analytics for data quality control (QC) and consistency control (CC) to
consolidate the digital representation of geospatially distributed
environmental and genomic data required for phenotype predictive analytics
and modeling of the GxE interaction. The two-phase QC–CC preprocessing
algorithm also includes a module to estimate environmental uncertainties.
Generally, this data pipeline collects raw files, checks their formats,
corrects data structures, and identifies and cures or imputes missing data.
This pipeline uses machine-learning techniques to fill the environmental
time series gaps, quantifies the uncertainty introduced by using other
data sources for gap imputation in G2F-E, discards the missing values in
G2F-P, and removes rare variants in G2F-G. Finally, an integrated and
enhanced multidimensional database was generated. The analytics for
improving the G2F database and the improved database called Climate for OMICS (CLIM4OMICS) follow findability, accessibility, interoperability, and reusability (FAIR) principles, and all data and codes are available at
https://doi.org/10.5281/zenodo.8002909 (Aslam et al., 2023a) and https://doi.org/10.5281/zenodo.8161662 (Aslam et al., 2023b), respectively.</p