This dissertation is composed of three research projects focused on developing statistical methodologies, theories, and algorithms for solving large-scale real-world problems with complex features.
The first project in chapter 2 deals with massive datasets on corn phenotypes and genotypes. We propose a novel hierarchical spatial Finlay-Wilkinson model for analyzing yield data and characterizing genotype-by-environment interactions from multi-environment field trials. The key ingredients in our hierarchical framework are (1) kinship information on the relatedness among genotypes from DNA sequence data, (2) spatial correlation among plot effects within fields and (3) environmental covariates obtained from weather stations. Together these ingredients enhance estimation of genotypic and environmental effects and reduce bias in estimating adaptability of the genotypes. Keeping practical application in mind, we develop a fast MCMC algorithm that allows us to sample from the posterior. Using a publicly available data from the Genomes to Fields initiative, we demonstrate that our method improves yield prediction over existing methods and permits yield predictions for new genotypes in new environments.
The second project in chapter 3 involves analyzing high-dimensional functional data.
we explore functional linear regression by focusing on the large-scale scenario that scalar response is associated with potentially an ultra-large number of functional predictors in the setting of a reproducing kernel Hilbert space (RKHS) framework. We propose a functional elastic-net model and introduce the Karush-Kuhn-Tucker (KKT) conditions in function spaces. By the functional KKT conditions, we show the unique solution of functional elastic-net exists. We provide sufficient conditions for establishing variable selection consistency and prediction consistency. An computational algorithm is also developed to solve the functional elastic-net problem efficiently. The performance of the proposed method is evaluated by simulation studies in various high-dimensional settings.
The third project in chapter 4 deals with image-based high-throughput phenotyping data. Specifically, the goal is to extract plant heights from an image-based field phenotyping system. We describe a self-supervised pipeline (KAT4IA) that uses K-means clustering on greenhouse images to construct training data for plant segmentation from images of field-grown plants, automatic separation of target plants, calculation of plant heights, and functional curve fitting of the extracted heights. This approach is efficient and does not require human intervention.
Our results show that KAT4IA is able to accurately estimate plant heights during which the plants in the first row do not overlap with plants in the background.
In chapter 5, we describe a sequential CNN pipeline that uses plant images in early growth stages to construct training data for separating foreground and background plant pixels for late stages of plant growth. This pipeline, together with KAT4IA, provides accurate plant height estimations during the entire plant growth period