High resolution, annual maps of the characteristics of smallholder-dominated croplands at national scales

Understanding agricultural change requires reliable, frequently updated maps that describe the characteristics of croplands. Such data are often unavailable for regions dominated by smallholder agricultural systems, which are particularly challenging for remote sensing. To overcome these challenges, we designed a system to minimize several sources of error that arise when mapping smallholder croplands. To overcome errors caused by mismatches between image resolution and cropland scales, as well as persistent cloud cover, the system converts daily, 3.7 m PlanetScope imagery into two seasonal composites within a single agricultural year. To reduce errors that occur when training classifiers, we built a labelling platform that rigorously assesses label accuracy, and creates more accurate consensus labels that train a Random Forests model. The labelling platform and model interact within an active learning process that boosts the accuracy of the resulting cropland probability map, which is used in a segmentation process to delineate individual field boundaries. We applied this system to map Ghana’s croplands for the year 2018. We divided Ghana into 16 mapping regions (12,160-23,535 km2), training separate models for each using a total of 6,299 labels, plus 1,600 for validation. Using an independent map reference sample (n=1,207), we found that overall accuracies of the resulting cropland probability and field boundary maps were 88% and 86.7%, respectively, with User’s accuracies for the cropland class of 61.2% and 78.9%, and Producer’s accuracies of 67.3% and 58.2%. Croplands covered 16.1-23.2% of the mapped area, comprising 1,131,146 total fields with an average size of 3.92 ha. Estimates based on the map reference sample indicate the cropland percentage is 17.1% (15.4-18.9%) or 17.6% (15.6-19.6%), depending on the map used to estimate the standard error. Using the labellers’ digitized field boundaries to estimate biases in field boundary statistics, we calculated an adjusted mean field size of 1.73 ha and total field count of 1,662,281. Although the cropland class contained substantial errors, the system was effective in mitigating error and quantifying resulting performance gains. By minimizing training errors, consensus labelling improved the model’s F1 scores by up to 25%, while 3 iterations of active learning increased the F1 score by 9.1%, on average, which was 2.3% higher than training models with randomly selected labels. Map accuracy can be improved by replacing Random Forests with a convolutional neural network. These results demonstrate a readily adapted, transferrable framework for developing high resolution, annual, nation-scale maps that provide important details about smallholder-dominated croplands. 30


1,
otherwise. (2) Where t is a particular date in near-daily time series of PlanetScope images, which begins at date 1 for 185 the given compositing period and ends on date i, blue is the blue band, and NIR the near infrared band. 186 Equation 1 assigns lower weights to hazy and clouded pixels as the blue band is sensitive to haze and 187 cloud pixels (Zhang et al. 2002), while Equation 2 assigns low weights to pixels in cloud shadow 188 considering the significant darkening effect of the cloud shadows in the Near Infrared band (Zhu and 189 Woodcock 2012, Qiu et al. 2020) Once these two weights are calculated, the final composited pixel value for each of the four PlanetScope 191 bands is: Which is a weighted mean for each pixel for each band B for the particular compositing period. The 193 composited tiles were then added to the S3 store (Figure 1), where they are stored as cloud-optimized 194 geotiffs, and a "slippy map 1 " rendering is created for each composite using Raster Foundry (Azavea 195 2020). The web-rendered imagery is presented within the training data platform (next section).  197 Training and reference data are collected by a custom labelling platform, which was originally designed 198 for AWS's Mechanical Turk job marketplace (Estes et al. 2016a). The basic structure of the system 199 remains the same, but we converted it into a standalone platform that allows us to enroll and pay 200 people directly for their labelling, and is designed to control and supervise the machine learning process.

201
The platform runs on a Linux virtual machine hosted on an AWS EC2 instance and is comprised of a 202 database (PostGIS/Postgres), a mapping interface (OpenLayers 3), an image server (Raster Foundry), 203 and a set of utilities for managing, assessing, and converting digitization work into rasterized labels for 204 training a machine learning algorithm. Each instance of the platform focuses on a specific AOI ( Figure   205 2A) 206 The following sections provide an overview of the labelling platform's architecture. into a training and validation sample, according to predetermined proportions. The grid, which is nested within the tiling and larger 1 degree grids ( Figure 2C) defines the spatial unit for a labelling job.

212
The selected cells are placed into a queue within the platform's database, and then converted into a 213 mapping task that has a specified number of assignments (boundaries drawn by an individual labeller) 214 that must be completed before the task is complete.
Where i indicates the particular assignment, and β 0−4 represent varying weights that sum to 1. I refers  Over time, labellers are assessed multiple times across a range of accuracy tasks, which are selected to 242 represent the variability of the agricultural system being mapped. Each labeller's score history is 243 averaged to provide an overall accuracy measure, and this information is used for creating labels, the 244 second task.

245
If the labeller's completed assignment was a training/validation task, their maps remain stored in the 246 database until the task's outstanding assignments are completed by other labellers. Once complete, 247 another routine is invoked, which combines the task's completed assignments into a single consensus 248 label using a Bayesian merging approach: Where θ represents the true cover type of a pixel (field or not field), D is the label assigned to that 250 pixel by a labeller, and W i is an individual labeller. P(θ|D) is therefore the probability that the actual 251 cover type is what the labellers who mapped it says it is, while P(W i |D) is an individual labeller's 252 average score over all the accuracy assessment assignments they have completed, and P(Wθ|D, W i ) is 253 the labeller's label for that pixel. This approach therefore uses the overall accuracy of each labeller to 254 weight their labels when combined with those made by other labellers' for the same pixel (see SI for 255 further details). As a further measure of confidence in the final consensus label, its average Bayesian

256
Risk can be calculated (see SI). This measure ranges between 0 and 1, with 0 indicating full agreement 257 between labellers for all pixels (n = 40000) in the label, and 1 indicating complete disagreement.  11X11 and 5X5 moving window, respectively (initial tests revealed these two window sizes to be most 269 effective), resulting in 24 overall features, including the original bands (Table 1).

301
After the final iteration, the segmentation algorithm is invoked, which entails several steps. In the first

319
To assess the accuracy of the final segmented boundaries, we used a two-step approach. First, we assessed the overall thematic accuracy of the resulting classification against our map reference data.

321
Second, to assess the quality of the segmentation, we compared the mean area and relative frequencies  where cocoa and oil palm are among the dominant crops. For these latter regions, we did not attempt 335 to classify tree crops, and instead mapped clearings that potentially contain field crops or newly felled 336 or recently replanted tree crops. We made this decision because PlanetScope's resolution is not high 337 enough for labellers to distinguish many tree crops from surrounding forest, and the boundaries of 338 many tree crops (e.g. cocoa) are often not visible.

339
To create the cropland maps, we divided the country into 16 distinct AOIs, which were developed by

359
To evaluate the performance of the system, we performed several analyses described in sections 2.3.1-4. 360 2.3.1 Image quality 361 We evaluated the overall quality of the resulting seasonal image composites by assessing a random 362 selection of 50 tiles. We graded both seasonal composites for each tile using a four category quality 363 score, which evaluated the degree of 1) residual cloud and 2) cloud shadow, 3) the number of visible 364 scene boundary artifacts, and 4) the proportion of the image that had its resolution degraded below the 365 typical 3-4 m PlanetScope resolution (e.g. because of between-date image mis-registrations). Each 366 category was qualitatively ranked from 0-3, with 0 being the lowest quality, and 3 the highest (see SI for 367 complete protocol), making the highest possible score 12. We rescaled scores to fall between 0 and 1.

Model gains per iteration 369
To assess the gain in model performance due to active learning, we measured the change in accuracy, 370 F1, and AUC (see 2.2.3.2) between each iteration and between the first and last iterations for each AOI.

371
To evaluate whether active learning improved model performance relative to a purely random approach 372 to selecting new training sites, we ran additional tests within a subset of AOIs (1, 8, and 15). We first    as the boundaries between agroecozones (see Figure S4A). The total number of unique training and 446 validation sites across the country were 6,299 and 1,600, respectively.

447
The distribution of training and validation sample collection effort was divided across 20 labellers, with 448 a core group of 13 who mapped more than 1,000 sites each. As each training site was mapped by 4 449 separate labellers, 34,014 sets of vector labels were made. Each labeller digitized an average of 2,001 450 (see Figure S5A for more details on labelling effort). Labeller accuracy was scored 9,389 times against accuracy score was 0.71 (range 0.6 to 0.85; see Figure S5B for detailed score distributions).
After each site was mapped by four labellers, consensus labels were generated. The Bayesian Risk (see

The impact of training data error 484
The potential impact of label errors on map quality was assessed in four AOIs (1, 2, 8, and 15). The 485 results of these tests showed that the average accuracy, AUC, and F1 scores for models trained with the 486 consensus labels were respectively 0.772, 0.8, and 0.555 ( Figure 6). Performance metrics from 487 consensus-trained models were just 0.5 -1.2 percent higher than those models trained with the most

491
A second measure of the impact of label error is found within the correlations between the mean label 492 risk per AOI and the model performance metrics (Table S3). Accuracy and AUC had strong  We used a map reference sample of 1207 sites (487 cropland; 720 non-cropland) to evaluate the accuracy 502 of the per-pixel classifications (resulting from thresholding the Random Forests probability), as well as 503 the segmented field boundary maps. We first evaluated the uncertainty in the map reference classes by 504 assessing 1) the overall agreement between map reference labels collected by two separate supervisors at 505 23 sites, and 2) the confidence of the labels assigned by the supervisors (see SI for details). The first 506 measure showed that the two individual supervisors' labels agreed at 87% of common sites, while the 507 second showed that 15.7 of sites were labelled with the two classes that indicated a level of uncertainty. 508 Figure 6: Scores for overall accuracy, area under the curve of the Receiver Operating Characteristic, and the F1 score resulting from models trained with consensus labels, and labels made by the most and least accurate labellers to map each site. Comparisons were made for AOIs 1, 2, 8, and 15, denoted by grey symbols, while the mean scores across these AOIs are shown for each metric.
We found that the overall accuracy of the pixel-wise classifications was 88% against this map reference 509 sample (Table 2). Confining the map reference sample to four distinct zones ( Figure S10A) shows that 510 overall accuracy ranged from 83.3% in Zone 1 (AOIs 1-3) to 93.6% in Zone 3 (AOIs 10, 11, 13, 15, and 511 16). The Producer's accuracy of the cropland class was 61.7% across Ghana, ranging from 45.6% in 512 Zone 3 to 67.9% in Zone 1, while the User's accuracy for was 67.3% overall, ranging from 59.8% in 513 Zone 4 to 71.2% in Zone 1. Both measures of accuracy were substantially higher for the non-cropland 514 class across all zones, typically exceeding 90%. The lowest accuracies for the non-cropland class was in 515 Zone 1 (Producer's = 89.3%; User's = 87.7%).

516
The overall accuracies obtained from the segmented maps were generally 1-2 percentage points lower 517 than those of the per-pixel maps, while User's accuracies tended to be 8-10 percentage points less 518 (Table 2). In contrast, Producer's accuracies were 15-20 points higher than in the per-pixel map. The 519 segmentation step therefore helped to reduce omission error while substantially increasing commission 520 error.    in the map reference sample, our best estimate of the "truth," in which we have roughly 85% 590 confidence. However, accuracies for the cropland class were much lower, falling between 62 (Producer's) 591 to 67 (User's) percent country-wide for the per-pixel map (Table 2), meaning the model produced 592 substantial commission and omission errors for this class. The segmented boundary maps had fewer 593 omission errors (Producer's accuracy = 79%), but higher false positive errors (User's accuracy = the exception of Zone 3, which had the highest User's accuracy, albeit from a very small sample. Aligning the reference samples more precisely with agroecozone boundaries ( Figure S10B) provides 603 further insight into error patterns (Table S4) natural vegetation with sparse herbaceous cover, which were common in many AOIs.

624
The inherent difficulty of the labelling task was another major limiting factor. Our system was designed 625 to minimize the error inherent in labelling, but determining croplands from non-croplands in these 626 agricultural systems can be a difficult task. Labellers have to evaluate multiple image sources and to 627 rely heavily on judgement, which inevitably leads to errors. Interpretation is particularly hard where 628 the background savanna vegetation and croplands have similar reflectance during the dry season, which 629 is a particular problem in AOIs 2 and 3. Smaller field sizes also complicate labelling, as these become 630 increasingly indistinct in the~4 m PlanetScope composites. The difficulty of labelling is reflected in the magnitude of the Bayesian Risk metrics ( Figure S6), and by the average score achieved by each labeller why we chose not to label on basemap imagery (in addition to restrictive usage terms), which is 715 typically several years old (Lesiv et al. 2018). However, we did not assess whether the higher label 716 accuracy one might achieve by digitizing on a <1-2 m resolution basemap would offset model errors 717 caused by temporal mismatches.

718
Another potential issue is the degree to which our assessment of the impact of label error on model 719 performance ( Figure 6) was influenced by the validation labels we used, which were generated using the 720 consensus method. This could have confounded the assessment, particularly when comparing models 721 trained with the most accurate individual label and those trained with consensus labels. However, the 722 visual assessment of their resulting probability maps confirm the differences in scores: consensus and 723 most accurate individual labels produce nearly identical maps with relatively high certainty, while low 724 quality labels led to a markedly less certain map ( Figure S9). The maps presented here represent a version 1 product that is freely available to use, along with its 727 underlying code (see SI for details). These data were developed according to the recommended best

734
To facilitate the next step, generating more accurate version 2 maps, several improvements will be 735 made. The first is to replace Random Forests with a more advanced convolutional neural network