Application of Probabilistic Machine Learning to the prediction of Remotely Sensed Vegetation Health

Every prediction of the future carries some level of uncertainty; making this explicit is challenging. This paper introduces a Probabilistic Machine Learning algorithm, namely the Natural Gradient Boosting algorithm, as a modelling tool for predicting the Vegetation Health Index, a proxy for monitoring vegetation stress in response to changing weather conditions. It then elaborates on the time heterogeneity of the probabilistic predictions generated by the model over one year on West Java, the most populated Indonesia region, providing important consideration for Food Security. The results highlight that while 94% of the observed values fall within the model’s 95% prediction interval, during February and March, the probability of reaching critically low levels of VHI below 40 is above 75%. The paper foster the multidisciplinary literature linking the use of remotely sensed vegetation health data with Food Security by incorporating uncertainty in predicting the impact of changing weather conditions on Food Security.


Introduction
The relation between weather and poor harvests has ceaselessly been a concern for many actors and stockholders in the field of agricultureKogan et al. (2005). The literature on food security has highlighted how climate change has a substantial impact on food supply and accessibility which in turn affect human health, food production and distribution channels (Anderson et al., 2020;Mbow et al., 2020). On the biological level, climate change has been shown to hindering plant growth and health (Mora et al., 2015), and the impact on global crop yields has been already estimated to be equivalent to a decrease ranging from 3% to 10% per degree of warming above historical levels (Porter et al., 2014). In this scenario, it has become of the utmost importance to monitor and provide reliable predictions of the future condition of vegetation health to allocate scarcer resources better and ensure food security.
On this meter, Remote Sensing enables collecting a variety of information over large spatial areas and on different resolution scales. The scientific community has made extensive use of satellite imagery for mapping and monitoring changes in land cover and estimating geophysical and biophysical characteristics of the soil. By characterising natural features on the ground and monitoring their changes over time, remote sensing applications have supported policy-making on different levels (Sishodia et al., 2020).
High-resolution satellite data have enabled tracking regional patterns and rates of forest utilisation and distribution. In the last decade, much attention has been dedicated to understanding the climate system and its changes, especially due to climate change. Here, satellite imagery has provided significant advances to investigate climate change-induced processes in the long and short term through the development of global climate models (Bosetti et al., 2006), Climate Extreme Indices (Mistry, 2019) and Vegetation Health Indices (Kogan, 2002;Kogan et al., 2004;Kogan, 2019).
The field's technical and methodological advances have brought a multitude of indexes to monitor several vegetation aspects (Xue and Su, 2017). The Vegetation Health Index (Kogan, 1997(Kogan, , 1987 is one such index that characterises vegetation health and has been often used to estimate crop condition and predict yield (Kogan, 1990;Orlovsky et al., 2010;Rahman et al., 2009;Zuhro et al., 2020).
On the other hand, Machine Learning algorithms have opened up new frontiers and enabled the development of new tools for analysing satellite imagery, providing better and more nuanced insights.
Scientists have used Machine Learning and Satellite Imagery to monitor tropical forest carbon stocks (Csillik et al., 2019), to identify and classify the direct drivers of deforestation  and to estimate the effect of public policies on agricultural productivity (Hammad et al., 2021). Machine Learning algorithms provide unprecedented and promising results thanks to their ability to find patterns underlying the complex nonlinear relations that characterised environmental variables. Remote Sensing and Machine Learning has advanced our knowledge of the nexus between climate change and vegetation health and provided accurate predictions of future scenarios. Nevertheless, Probabilistic Machine Learning has highlighted the importance of accompanying any statement about the future with a quantification of its uncertainty to give a complete picture of the possible scenarios and better inform decision-making by land managers, farmers, and policymakers (Duan et al., 2020;Ghahramani, 2015;Gneiting and Katzfuss, 2014;Palmer, 2012).
The aim of this paper is twofold. Firstly is to provide a review of the methodological background of Probabilistic Machine Learning and Remotely Sensed Vegetation Health. Secondly, to practically demonstrate the use of Probabilistic Machine Learning to predict one-month ahead the Vegetation Health Index over West Java, the most populated region of Indonesia, and explore the time heterogeneity of the probabilistic predictions.
The paper foster the multidisciplinary literature linking Food Security with the use of remotely sensed vegetation health data and the latest advances in the field of Probabilistic Machine Learning, incorporating uncertainty in the prediction of the impact of changing weather conditions on Food Security.
The remainder is structured as follows: Section 2 provides an overview of the theory behind the 2 measurement of Vegetation Health through remote sensing and the methodological background of Probabilistic Machine Learning with an emphasis on the Natural Gradient Boosting algorithm that will be used in Section 3. This section describes the study area, the data and the statistical modelling routine.
Section 4 presents the analysis results, and Section 5 elaborates on the strengths and limitations of the analysis.

Vegetation Health
The literature on the assessment of agricultural productivity has taken great advantage of satellite technologies, and several indices have been developed to characterise land cover with a specific focus on vegetation health. As defined by Kogan (2019, p.52): Vegetation health is the method designed to derive vegetation condition (favorable, unfavorable, normal, etc.) or health (healthy, unhealthy, stressed,etc.) in response to changing weather (precipitation, temperature, and others) in each ecosystem and climate.
Remote sensing instruments detect light within the electromagnetic spectrum of an object when it is illuminated. Depending on the source of energy used to illuminate the object, there are two ways to collect remote data. Actively, that is, a sensor device emits a signal on the object of interest, and the sensor captures its reflection. Passively, when the natural sunlight reflection of the object is used (Woodhouse, 2017). An import implication for measuring vegetation health using remote sensing instruments is that plants absorb most of the light as part of photosynthesis. Specifically, vegetation absorbs the secondhighest amount of solar energy in the visible range of the solar spectrum, which turns to increase the photosynthetic rate and green mass accumulation. Hence healthy plants reflect more near-infrared (NIR) than visible red (RED). When a plant becomes unhealthy, it reflects more RED and less of the NIR. This finding brought to the introduction of the Normalized Difference Vegetation Index (NDVI) measured as the ratio of the difference of the red RED and NIR radiance over their sum as in equation 1 (Kriegler et al., 1969).
Based on the NDVI, Kogan (1987Kogan ( , 1990 proposed three alternative indexes. to 100 (very unhealthy), with values from 40 to 0 indicating increased vegetation stress and 60 to 100 suggesting better vegetation health conditions.
The three Vegetation Health indexes developed by Kogan  The findings in the literature highlight that VHI correlates well with meteorological drought and agricultural drought in monsoonal rainfall areas (Ma'rufah et al., 2017) and, most importantly to be useful in predicting the yield of several grain crops such as corn in China (Kogan et al., 2005), wheat in the USA (Salazar et al., 2007), and rice in Bangladesh (Rahman et al., 2009) several months in advance of the harvest with considerable implication for advanced projections of food insecurity scenarios.
On the other hand, VHI was found to perform poorly in high latitude regions due to the assumption of inverse correlation between NDVI and BT that is not met in northern ecosystems where increasing temperatures support plants growth (Karnieli et al., 2006).

Probabilistic Machine Learning
Uncertainty and its quantification are fundamental aspects of any environmental analysis, and they are now crucial aspects in the face of climate change and its complicated consequences. Epistemic uncertainty (caused by limited data and knowledge) and aleatory uncertainty (randomness or variability in the underlying variables) are virtually present in all projections of a future phenomenon (Helton et al., 2010). In the last decade, researchers in the field of Probabilistic Machine Learning have developed more sophisticated algorithms that provide reliable predictions and account for the inherent uncertainties of Machine Learning algorithms (Duan et al., 2020;Ghahramani, 2015;Gneiting and Katzfuss, 2014;Kabir et al., 2018). By shifting from a pointwise estimation to a distribution estimation, we explicitly integrate the notion of uncertainty in our prediction while providing more comprehensive information to the enduser.
A typical Machine Learning algorithm, produce a point estimate as in equation 5 where x indicates the set of covariates and y is the target variable. Instead, a probabilistic model estimates the conditional probability distribution as equation 6 The output is a full probability distribution over the entire outcome space that capture the potential values that the outcome could take, or in other words, a probabilistic model outputs a prediction interval where its width varies depending on the probable values of the target. The algorithm is made of three components, namely the Base Learner (g), the assumed Parametric Distribution of the outcome of interest conditional on the covariates (P θ ), and the Scoring Rule (or Loss Function). As in other implementation of GBM, the set of covariates is passed to the chosen base learners (typically Decision Trees) but to train the model to predict the parameters of the chosen distribution (e.g. Normal) rather than predict the outcome as a point estimate. In this way, it is possible to produce a probability distribution of the predictions P(y|x) over the outcome space. More specifically, each parameter θ of a predicted conditional distribution P θ are estimated through an additive combination of L base learners separately and scaled with stage-specific scaling factors ρ (l) and common learning rate η (equation 7). Assuming that P θ correspond to a normal distribution, we have g (l) µ and g (l) logσ .
The estimated probability distribution is compared with the observed data, and a score is assigned to the forecast. The typical scoring rule used in a probabilistic regression model is based on the average negative log-likelihood (NLL). However, the authors also provide the Continuous Ranked Probability 5 Score (CRPS), a generalisation of the Mean Absolute Error (MSE), to a probabilistic model.
The key novelty of NGBoost is to overcome the inflexibility and constraints of existing probabilistic algorithms such as Bayesian Additive Regression Trees (BART) (Chipman et al., 2010) with the use of Natural Gradient. Natural Gradient features the use of the Fisher Matrix to scale the parameters updates following a log-likelihood function. Compared with the gradient used in other GBM implementations that is a first-order optimisation method, the Natural Gradient has the property of constraining the output distribution in each update step (Amari, 1998).

Study area
West Java is the most populated province of the Indonesian Archipelago, with a population of Turning to food security, West Java has already experienced extensive losses in rice production due to droughts, floods or pests. Between January to September 2020, more than 12 thousand ha of paddy fields were affected by these hazards (WFP, 2021). The number of individuals below the poverty line is estimated to be 3,920.2 thousand (BPS, 2020b), which corresponds to a Poverty Severity Index of 0.23 (BPS, 2020c). Child stunting ( 35.3%) due to malnutrition and poor access to food is also a significant issue related to food security (UNICEF, 2018).
6 Figure 1: Map of the West Java and its geographical location

Data sources and processing
The analysis is based on a combination of remotely sensed and atmospheric data detailed in Table 1 from the first week of May 2000 to the last week of January 2021. The primary outcome of the analysis is the VHI over cropland produced from the National Environmental Satellite, Data, and Information (NESDIS) part of the National Oceanic and Atmospheric Administration (NOAA). The data is available from 1981 with 4 km spatial and 7-day composite temporal resolution. To capture the long-term contribution of climate on vegetation health, yearly means of precipitation and temperature are included as covariates. For the short-term weather contribution, monthly rainfall, temperature, humidity, solar radiation means, and the total amount of rain by month are used. Raw covariates data are masked over cropland pixels on the Google Earth Engine platform (Gorelick et al., 2017) using the GFSAD1000 gridded cropland product (Teluguntla et al., 2015); then, the mean value of each variable is extracted for each day of the study period over West Java.
The masked data are than imported into the R scientific programming environment, where the data.table (Dowle et al., 2019) library is used to combine the data and apply additional processing to generate a set of features that could enhance the accuracy of the model for a one-month ahead prediction and capture trends and seasonality in the time series. This includes lagged feature of each variable, rolling sum of rain and rolling mean of rain, temperature, humidity and solar radiation over the previous month.
Year, month, week of the year, and sine and cosine function of time (for week, month, and 7 year), are included as additional predictors. Finally, the chained random forest algorithm (Stekhoven and Buhlmann, 2011) of the package MissRanger (Mayer, 2019) is used to impute the missing data (3%).

Statistical modelling
To provide a probabilistic prediction of a given variable, the first step is to choose the parametric probability distribution of Y conditional on X. A first exploration of the outcome variable can give a good idea of the distribution followed by a statistical test (Hohberg et al., 2020). The latter is carried with the use of the gamlss function from the gamlss library (Stasinopoulos et al., 2007) to test how well a given conditional distribution fits the data based on the Quantile Residuals diagnostic fit of mean, variance, skewness, kurtosis, and the Filliben correlation coefficient test (Filliben, 1975) which measure the correlation between theoretical and sample quantiles of the conditional distribution of interest.
Using the data mentioned above and the chosen parametric distribution, the NGBoost algorithm (Duan et al., 2020) (as implemented in Python) with Regression Tree as the base learner and the Log-Score as the Scoring Rule, is trained on the first 19 years of weekly data (from the first week of May 2000 to the last week of January 2020). Tenfold cross-validation with a random search and 200 iterations is performed to find the best hyper-parameters (Wright and Ziegler, 2015) evaluated with the root mean square error (RMSE). Finally, the last year of data (from the first week of February 2020 to the last week of January 2021) is used as a test-set and to further explore the time heterogeneity of the distribution of the probabilistic prediction.

Choice of the Parametric probability distribution
As discussed in the previous paragraph, a first step to get an idea of the form of probability distribution to be used to model the data can come from a simple plot of the distribution of the outcome variable. Figure 2 depicts the distribution of VHI for the whole study period, which resembles a normal distribution. 8 To test whether the normal distribution is a good candidate, additional test are carried using gamlss.
The set of covariates detailed in the previous section is related to each of the normal distribution parameters, that is, the mean (µ) and standard deviation (σ) of the outcome variable. Figure 3 and Table 2, show that the q-q plot and statistics suggest that the normal distribution fit well the conditional distribution of VHI on the covariates. Specifically, the q-q plot does not show remarkable departures from a standard normal distribution, and the statistical test of the quantile residuals highlights a high, very close to one, Filliben correlation coefficient. Skewness and Kurtosis are close to 0 and 3, respectively. Additionally, the mean and the variance are very close to the expected values of 0 and 1, respectively.      (Table 4). This highlights the ability of the model to correctly predict the range of possible values that the outcome could take at each single predicted point within a 95% probability interval.

Time heterogeneity of the predicted values
Using density ridgeline plots, I explore changes in the VHI probability distributions over the months from February 2020 to January 2021. From the density plot in figure 5, we can see that the range of possible values of VHI goes from around 30 to 55, with most of the months falling in the interval 40 to 50.
February and March are the months with critically low level below 40, indicating increased vegetation stress. Between June to November, the VHI is expected to fall in a range indicating good vegetation condition (45 to 55) with not much variation. indicating stressed vegetation. February and March 2020 have a probability equal to 1 to fall below both the two thresholds but the chance of having a VHI below 40 drops down to zero from the 10th week and increases again at the end of the study period, which correspond to January 2021. The probability of having the VHI below 50 ranges from 0.5 to 0.9 for most of the year.

Discussion and conclusions
In a scenario where climate change, population growth and global diseases threaten economic access to food and increase the chances of malnutrition, it is fundamental to provide reliable predictions of vegetation health condition to inform food security plans and policy-making better. This work takes a step forwards, shading light into how to mitigate the uncertainty that accompanies every statement about the future. Using a novel Machine Learning algorithm and 20 years of weekly high-resolution satellite data, the analysis shows the ability of the model to learn well the historical patterns of vegetation health levels as measured by the VHI and output a prediction interval that captures 94% of the observed values.
By exploring the probabilistic distribution and the time heterogeneity of the output, February and March are found to be characterised by a probability of reaching critically low levels of VHI below 40, above 75%.
The limitations of the analysis and its findings can be found in the potential error in remotely sensed variables used in the analysis and generally in the data granularity. The analysis could be extended to other regions of Indonesia and other developing countries where food security is at risk. Overall the analysis contributes to the growing literature focused on using remotely sensed data and machine learning algorithm to improve food security.

Code and data availability
The R, Python and Google Earth Engine JavaScript code to replicate the analysis and the figures are publicly hosted at https://github.com/athammad/probVHI. The repository includes references to retrieve the input data.