Using Machine-Learning Models for Field-Scale Crop Yield and Condition Modeling in Argentina

Accurately determining crop growth progress and crop yields at field-scale can help farmers estimate their net profit, enable insurance companies to ascertain payouts, and help in ensuring food security. At field scales, the troika of management, soil and weather combine to impact crop growth progress, and this progress can be monitored in-season using satellite data. Here, we use satellite derived metrics, from both optical and radar satellites, and machine learning models to model field-scale crop yields for over 3,000 Soybean and Wheat in Argentina. We compare several machine learning models and our results show the promise of combining mixed effect models with nonparametric models in improving yield modeling capabilities. We also demonstrate the utility of specific satellite derived metrics and extracted features in improving model performance and show that our approach can explain greater than 70% of the variation in yields while remaining generalizable across crops and agro-ecological zones.


Introduction
Accurate and timely crop yield estimates are critical in ensuring food security [1]. Approaches to estimate field-scale crop yields include expert scouting estimates (considering weather and historical yield), process-based models and crop cut information. While these are less expensive than conducting full field harvests, these approaches are not scalable at low costs across multiple geographies. In this context, satellite data can play a key role by providing a low cost, crop agnostic, spatially and temporally scalable approach to monitor large areas of agricultural land globally [2]. Satellite based crop vigor indicators, like the Normalized Difference Vegetation Index (NDVI) provide a timely, repetitive and synoptic indicator of the impact of these factors on potential yields [3] and various studies have demonstrated the utility of NDVI and similar indicators in capturing yield variations at field scales. The biophysical basis of this approach is that the factors impacting crop growth and yield first affect the photosynthetically active biomass in a plant, and that effect can be captured through indicators like NDVI.
However, their performance, even in the generally homogeneous fields in the U.S. is usually low, with R-squared values around 0.5 [4]. This can have several potential reasons including poor selection of empirical models and saturation of NDVI for highly productive crop fields.
In our present work, we present a novel approach to estimate field-scale crop yields for Soybean and Wheat in Argentina based on a collection of greater than 3,000 yield estimates. Argentina grew more than 17 million hectares of soybeans with a total production of 50 million tons (2019/2020) and is the largest exporter of soybean oil and meal and the third largest soybean producer in the world.

Methods
Our yield forecasting approach is based on a training dataset of 2020 Soybean and 928 Wheat fields with reported yields from the 2017 to 2019 growing season. We used satellite data from Sentinel-1, Sentinel-2 and Landsat-8 and computed NDVI and Green Chlorophyll Vegetation Index (GCVI) from the optical data and VH, VV and VH/VV for the radar data. For each of the time-series, we computed multiple timeseries characteristics or features [5]. These features capture various properties of the vegetation index that could potentially impact crop growth and yields. We implemented 4 machine-learning algorithms including Lasso regression, Generalized Additive Models (GAM), Random Forests and Mixed Effect Random Forests (MERF, [6]) In MERF, we used a gradient boosting algorithm called CatBoost [7], and used one of two regional classifications: administrative boundaries, zone designation to determine the random effects. In all, we tested 4 machine-learning models, conducted 8 different experiments with different choice of features, for 9 different satellite data sources for Soybean and Wheat for a total of 576 (2 x 4 x 8 x 9) unique combinations.
We used the entire dataset and conducted a 10-fold cross validation to determine the spread of R-squared and RMSE values resulting from using different subsets of the data for creating the training and testing datasets.

Results
Based on the model performance from 576 experiments, we found that the MERF model performs best for both Soybean and Wheat, with much larger R-squared and lower RMSE values compared to the other models. For Soybean, MERF using GCVI derived from Sentinel-2 is the best model, resulting in a RMSE value of 610 kg/ha and a R-squared of 0.71 (Figures 1, 3). The best performing Wheat model uses the MERF model on the NDVI time-series computed from Sentinel-2, resulting in a RMSE value of 696 kg/ha and a R-squared of 0.69 (Figures 2, 3). The results from Random Forests and GAMs are similar for both crops and it should be considered that the latter offers much more interpretability.
Our results indicate that the optical satellites (Sentinel-2 and Landsat-8) have similar performances for the MERF model for both Soybean and Wheat. However, for the other models, Sentinel-2 always overperforms Landsat-8. Sentinel-1 data does not suffer from the same issues as the optical satellites in terms of cloud cover but in our results, it is generally not competitive with the other two satellites.

Conclusions
In this paper, we model field-scale crop yields in Argentina by deriving features from three moderate resolution sensors and applying multiple machine learning models to the data. We see potential in combining mixed effect based approaches with nonparametric models, as they are able to capture the regional variability in crop yields. Our approach is crop agnostic and highly scalable at low costs since it uses global data that are freely available and can be applied to small-scale intercropped farming systems as well as large monoculture farms. In future work, we plan to improve model performance by exploring combining SAR and optical data, operationalizing the system to produce in-season crop yield forecasts for Argentinian farmers and assess the portability of our approach to other crops and countries in Latin America.