Improving Genomic Prediction with Synthetic Traits
Genomic prediction has been one of the essential tools for modern breeding programs, especially for traits that are expensive, labor-intensive, or low-throughput to phenotype. However, phenotypic information is still needed to train prediction models, and the quality of these predictions is bounded by the quality of the phenotypic data set (i.e., heritability). One option to improve prediction accuracy in low heritability traits is to incorporate traits with higher heritabilities into a multivariate prediction model. When the correlated trait is assessed with higher throughput than the trait of interest, using multi-trait models becomes very advantageous. Therefore, we hypothesized that incorporating high-heritable synthetic traits (ratios of leaf reflectance spectra) in multi-trait genomic prediction models could help predict traits of low-throughput phenotyping. We conducted an experiment where we phenotyped a biomass sorghum diversity panel of 836 lines in two locations in Illinois. We measured specific leaf area (SLA), leaf nitrogen content (N), and leaf-level reflectance (350-2500 nm). Additionally, we used previously developed partial least squares regression (PLSR) models on the reflectance data to create two new traits, predicted SLA (PLSR_SLA) and predicted N (PLSR_N). To create the synthetic traits, we obtained all possible pairwise ratio wavelengths and calculated their co-heritability with each of the four traits of interest (SLA, N, PLSR_SLA, and PLSR_N). We generated a heatmap in each case and highlighted the top 1% highest co-heritability. Different regions were highlighted in the heatmap, and we selected the wavelength ratio with the highest co-heritability from each area. To further reduce the number of synthetic traits, we used hierarchical clustering based on Euclidean distance, which allowed us to identify three subgroups for N and PLSR_N and 2 for SLA and PLSR_SLA. The wavelength ratio with the highest co-heritability was used for each subset as a synthetic trait. We evaluated three genomic prediction scenarios i) univariate biological traits (N and SLA), ii) univariate predicted traits (PLSR_N and PLSR_SLA), iii) Multi-trait GS of SLA, N, PLSR_SLA, or PLSR_N along with the synthetic traits. We utilized a five-fold cross-validation for single and multi-trait models. For the latter, we evaluated a CV1 scheme, where 80% of the target trait and 80% of the synthetic trait were used to train the prediction model, and a CV2, where 80% of the target trait and 100% of the synthetic trait were used to train the prediction model. To reduce bias, we used individuals from one location as the training set and the other location as validation. Prediction accuracies ranged from 0.41 for N to 0.66 in PLSR_N combined with synthetic traits. The combination of PLSR_N or PLSR_SLA predictions with synthetic traits resulted in higher prediction accuracy than the single model for either N or SLA. As hypothesized, the use of synthetic traits in multivariate models increased prediction accuracy and could be a good option for traits of low heritability.