Evaluation and comparison of a machine learning cloud identification algorithm for the SLSTR in polar regions

A Feed Forward Neural Net (NN) approach to distinguish between clouds and the surface has been applied to the Sea and Land Surface Temperature Radiometer in polar regions. The masking algorithm covers the Arctic, Antarctic and regions typically classified as the cryosphere such as northern hemisphere permafrost. The mask has been trained using collocations with the CALIOP active lidar, which in narrow strips provide more accurate detection of cloud, and was subsequently evaluated as a function of cloud type and surface type. The mask was compared with the existing operational Bayesian and Empirical cloud masks by eye and also statistically using CALIOP data. It was found to perform exceptionally well in the polar regions. The Kuiper skill score improved from 0.28, for the operational Bayesian and 0.17 for the Empirical masks to 0.77 for the NN. The NN algorithm also has a much more homogeneous performance over all surface types. The key improvement came from better identification of clear scenes; for the NN mask, the same performance in terms of contamination of cloudy pixels in the sample of identified clear pixels can be achieved while retaining 40% of the clear pixels compared with 10% for the operational cloud identification. The algorithm performed with almost the same skill over sea and land. The best performance was achieved for opaque clouds while transparent and broken clouds showed slightly reduced accuracy.


Introduction
Clouds play an important role in moderating the solar radiation incident on Earth and regulating the amount of radiation back to space. The balance of radiation reflected and emitted depends critically on the coverage, temperature and albedo of the cloud. How these cloud properties will change in a warming 5 climate is still highly uncertain as outlined in the latest IPCC report, Stocker et al. (2013).
The polar regions are particularly important moderators of the Earth global radiation balance directly and indirectly through circulation changes and global teleconnections. Over the course of the year, heat moves away from the equator 10 into the polar regions and escapes through the atmosphere; the polar region typically gives off more heat than it absorbs.
Changes to the global climate can be amplified over the polar regions. There is evidence that the polar regions are changing faster than other regions. In the IPCC Special Report on the Ocean and Cryosphere in a Changing Cli-15 mate, Pörtner et al. (2019), the Arctic sea ice extent was shown to have decreased significantly in the past few decades. In the Antarctic, the changes are more uncertain with the sea ice extent increasing until 2014 and then more recently decreasing at rates higher than those observed in the Arctic, described in Parkinson (2019). The impact of clouds on the radiative forcing in the 20 polar regions is complex and can result in both positive or negative feedbacks depending on how the amount and type of clouds changes in response to global warming and sea ice loss. This has been outlined in Goosse et al. (2018) and Huang et al. (2019).
The permafrost region as defined in Obu et al. (2019), is a large carbon store 25 sensitive to climate changes with effects of thawing permafrost on the formation of clouds in the region and any subsequent feedback effect in this region is also currently uncertain. Further studies of the correlations between clouds and effects that have a large impact of the climate can be improved by having a better performing cloud mask that performs uniformly across all surface types. 30 Cloud masking is essential in its own right for monitoring of trends in cloud coverage and properties. In addition, it is an important first step in developing satellite retrieval algorithms of surface properties or atmospheric variables.
Whether the aim is to observe the surface temperature, water vapour or aerosols, cloud masking over the polar regions is particularly challenging as frozen ground, 35 sea ice and snow have similar spectral properties to clouds i.e. the surface is white and bright and the surface temperature is cold. For these reasons, cloud masking algorithms that work well over land and sea surfaces outside polar regions, typically perform poorly within them. As a consequence of this, in order to reduce cloud contamination biases in surface variable retrievals, existing al-40 gorithms typically over mask in these regions, and hence the number of pixels in an image that are classified as not having cloud cover is dramatically reduced compared to what is possible with a better performing algorithm.
There is presently a paucity of ground-based observations in the polar regions and in the past the situation was even worse. Satellite observations are 45 thus important and can be used to fill the spatial and temporal gaps necessary to monitor the changes in the polar regions. When considered together, the observations may provide insight into exchanges in radiation between the surface, ocean and atmosphere that could have impacts for atmospheric and oceanic circulation. Satellite observations can be used to evaluate the representation of 50 these effects in climate models.
In this paper we present a cloud identification algorithm developed for the SLSTR (Sea and Land Surface Temperature Radiometer) instrument on board both Sentinel-3A and Sentinel-3B. The algorithm has been developed specifically for the polar regions encompassing ocean, sea ice, inland sea, land and 55 permafrost.
The cloud identification algorithm uses a Feed Forward Neural Net (NN) algorithm that has been trained with collocated data from the CALIOP Instrument. The algorithm is presented as well as a sensitivity study and the validation results. The algorithm is compared with the existing operational 60 masks for SLSTR. The paper demonstrates how a neural net can be used to im-prove the performance over land significantly and proposes a new methodology for future cloud masking inter comparison activities.

SLSTR
The Sea and Land Surface Temperature Radiometer (SLSTR) instrument, Coppo et al. (2010), on board the Sentinel-3A and 3B satellite platforms is the latest in a series of dual view visible-infrared passive radiometers launched by ESA to measure sea ice, surface temperature, aerosols and clouds, Merchant been a consistent feature of all the instruments (although for SLSTR it is now a backward view rather than a forward view), the coverage and the number of channels has improved with each successive instrument. SLSTR now has a 1400km Nadir view and a 740km oblique view. When both satellites are operating, the revisit time at the equator is 0.8 days. Key channels from 2002 onwards 80 are the .55 .66 .87, 1.6 3.7 11 and 12 µm channels. For SLSTR, 1.3 and 2.2 µm are new channels. The 1.3 µm channel is a particularly useful addition for cloud identification as it is a water vapour absorption band particularly sensitive to cirrus clouds. The 2.2 µm channel will be useful in this region, as in addition to the 1.6 channel it aids in the discrimination between snow and cloud (Schmit 85 et al. (2005)). The instrument specifications are shown in table 1. The instrument series is designed for high accuracy, with well calibrated measurements and low channel noise. Each instrument benefits from on board visible and infrared calibration as well as rigorous post launch vicarious calibration. Current cloud identification techniques can be broadly classified as empirical, with the mask based primarily (but not necessarily exclusively) on multi spectral thresholds. The advantage of this approach is its simplicity and speed to implement and the ability to switch on and off tests according to the application. Another common approach is to use a Bayesian approach. The Bayesian 95 cloud detection scheme calculates a probability of a clear sky for a given pixel using the observations from the satellite, with prior information about the atmosphere and surface conditions and the uncertainties in these variables (Karlsson et al. (2015); Heidinger et al. (2012); Merchant et al. (2005)). The key advantage of this scheme is a probability quality variable and the use of a priori 100 information to constrain the result. A third technique which has recently been applied are machine learning models as demonstrated by Jeppesen et al. (2019) and Sus et al. (2018). These algorithms are gaining momentum as they have demonstrated good performance; however they require large training data sets to deliver good results. Sus et al. (2018), developed a Neural Net model for 105 the detection of cloud from the ATSR series of instruments as part of the ESA Cloud cci project. The ATSR cloud identification model was developed using AVHRR data as a proxy and the algorithm was transferred using coefficients.
The advantage of this approach was that collocations between AVHRR and CALIOP, which are numerous and global, could be used to train the model.

110
The major disadvantage of this approach was that it could not use the full information content of the ATSR instrument. Even so, the algorithm delivered good results in independent validation analysis described in Bulgin et al. (2018); Poulsen et al. (2019).
The existing operational SLSTR product provides a number of cloud masks 115 which are briefly described below. The cloud masks will be evaluated in this paper together with the NN mask.

120
To summarise, the tests use thresholds on the visible and infrared channels as well as some spatial coherence tests. The cloud masking scheme is based on those employed for ATSR-2 and AATSR described in Závody et al. (2000) with additional tests developed specifically using the 2.2 and 1.3 µm channels. Some of the tests depend on results from previous tests, 125 hence it is important to consider the order they are applied.

Issues with current cloud identification
Current cloud masking for operational satellites are far from perfect for all applications and a number of key issues, outlined below, remain to be addressed.

145
• Conservative cloud masking: To avoid biases in surface temperature and aerosol retrievals, existing retrieval schemes usually adopt a conservative approach to cloud screening in order to avoid potential biasing effects. Bulgin et al. (2014). This means that the algorithm uses a cloud mask that has a small false positive rate which has both advantages and 150 disadvantages: the advantage of this approach is that the retrievals are rarely biased locally by unidentified clouds that will make a surface temperature retrieval generally appear colder than it actually is or an aerosol optical depth higher; the disadvantage is that the global coverage of retrievals is significantly reduced and in areas of persistent or difficult to 155 identify cloud coverage, there may be few, if any, surface or atmospheric measurements. In Holzer-Popp et al. (2013), cloud masking was identified as one of the key reasons for the differences in satellite aerosol retrieval performance.
• One size fits all: Cloud masks, particularly operational ones, are often 160 designed with a single use case in mind with the underlying assumption being that a single cloud mask is suitable for all users. Often only a binary cloud mask is provided. In some cases, thick aerosol plumes are masked out as it may impact the surface retrieval. However, this will bias a global aerosol retrieval.

165
• Uncertainty definitions can be confusing: While uncertainty measures are proving an invaluable product for satellite retrieval products such as surface temperature; how this uncertainty is applied to cloud identification is often unclear. For example: is the uncertainty a measure of the likelihood of a cloud or not, or is it a measure of the impact on the geophysical 170 property the user is trying to retrieve. A case in point is that a warm low optically thin cloud could have a high uncertainty because it is difficult to identify but have little impact on the retrieved surface variable.

Neural net based algorithm
The use of a NN for cloud identification is motivated by the fact that a 175 large set of input information provides a number of weak identifiers for the identification. The algorithm developed here is strictly based on the information in a given pixel and doesn't make use of any information from neighbouring pixels or the complete SLSTR image. The overall approach to developing the algorithm can be summarised as:

180
• Create a dataset that has truth labels attached to the SLSTR pixels. This truth label tells the algorithm, during training and validation, whether the pixel in question is cloudy or not.
• Split the dataset into two. One is used for training the algorithm and one is used for validation and provides feedback to the algorithm on for how 185 long the training should continue.
• Train the NN to predict the truth labels of the validation sample in the best possible way through adjusting its internal parameters.
• Save the NN configuration such that it can be used on independent data where no truth labels are present.

Review of datasets for training and evaluation
Any machine learning algorithm that depends on supervised learning for its configuration will rely critically on the creation of the dataset that it is trained on.
In the past ATSR and SLSTR cloud masks have been trained and evaluated 195 using hand classified scenes. Bulgin et al. (2014). While each hand classified scene has many thousands of pixels, and is in general useful for training and evaluation, there are a number of disadvantages of this approach: • Hand classified images are expensive to produce and as a consequence very few hand classified images are made publicly available. The delay time 200 between creation and release time can be very long, stifling innovation.
• Because there are relatively few hand classified images, they cover only a few select regions, and the cloud types within the images will be highly correlated, reducing significantly the global representivity.
• The hand classification can be quite subjective for difficult to classify 205 clouds, particularly for cloud edges, clouds that have small optical depth, and clouds over bright surfaces such as desert or ice.
• There is little added extra information in the mask to enable a more insightful cloud mask to be produced, e.g cloud type, height or optical thickness.

210
An alternative approach has been to consider instruments such as MODIS or AVHRR as a proxy as in Hollstein et al. (2015) and Sus et al. (2018).
Such an approach doesn't take into account the different spectral shape of the individual channels, the associated noise, calibration, instrument geometry and the additional channels available in SLSTR.

215
In other cases, the mask is developed using surface synoptic observations (SYNOP) (Istomina et al. (2010)). However, these observations, while plentiful in time, are highly subjective and only located over land.
In this paper we will use collocated CALIOP measurements to train and evaluate the cloud identification scheme. is 333 m and the vertical resolution is 30-60 m. In this analysis, we use the CAL LID L2 V4-20 1 km and 5 km products. As statistical noise is averaged out, the sensitivity to cirrus cloud is higher for the 5 km than for the 1 km product; however, the 1 km product was used to train the NN as that resolution matches the SLSTR product and thus minimises the mis-collocation with 230 small broken clouds. The 5 km product was used for additional evaluation as a function of optical depth as this product is only present in the 5 km product.

235
This collocated dataset is created by identifying a set of pixels in SLSTR images where there is simultaneous information from CALIOP on the same area.
For the creation of the collocated data, the time difference of the SLSTR image and the cross-over of CALIOP is required to be less than 20 minutes. A spatial match is identified if the centres of the 500 m × 500 m SLSTR pixels and the  Antarctic data is between 61.8 • S and 78.1 • S. A common problem with datasets used for training neural nets is that each entry in the dataset is highly correlated to other entries. If that is the case, the dataset effectively behaves as a smaller dataset. If a low number of SLSTR images with all pixels matched are used for training that is exactly what will happen; they cover only small periods in time 255 and have many pixels with the same clouds and surface types. As the matched dataset in our approach only cover a very thin strip across an SLSTR image, the overall dataset will only have small correlations between the pixels. Thus the dataset is more powerful to train on than a similar number of matched pixels in a low number of SLSTR images. The disadvantage is that collocations within a In the SLSTR data, 22 inputs were identified to be used as features for the neural net. These consist of: 9 spectral channels, latitude, longitude, satellite zenith angle, solar zenith angle, surface type flags and some ancillary information. The surface types Coastline, Ocean, Tidal, Dry Land, Inland water were provided as individual binary input channels rather than as a single bitmask.
The Dry Land mask is obtained by taking the pixels labelled as Land but subtracting those labelled as Inland Water. The ancillary information Cosmetic,

270
Duplicate, Day and Twilight were provided as binary input channels as well.
The first two can inform the algorithm about how much the information can be trusted while the latter two, while fully correlated with the solar zenith angle allows for categorising afterwards. The Snow flag was not used as it contains information derived from other cloud identification algorithms and in addition 275 was observed to be of low quality. Sun glint information was not used as it is never an issue in polar regions.

Neural net
The NN chosen was a simple feed forward net with the implementation based on TensorFlow (Abadi et al. (2015)). The geometry of the net was, using trial

Performance
This section will start out with specific qualitative examples of the NN to label scenes from SLSTR in comparison to the Bayesian and Empirical masks.
We will then define the specific metrics used for a detailed quantitative comparison that in the subsequent subsections are evaluated according to surface type, 310 cloud type and optical depth.
For each pixel in an SLSTR image, the trained NN provides an output between 0 and 1. An output close to 0 means that the algorithm is confident that the pixel is clear while an output close to 1 means that the algorithm is confident that the pixel is cloudy. The way that this information is usually 315 used, is that a threshold is defined and everything with an output below this threshold is classified as clear and everything above it as cloudy. The choice of threshold depends on the use case. A very low threshold will be able to identify clear areas with very few cloudy pixels wrongly classified, while a high threshold will identify cloudy areas with a very low contamination of clear pixels. The 320 optimal threshold to use might depend on the surface type.

Classification examples
For visual comparisons to other algorithms, we will focus on the ability to label areas as clear. In Fig. 2

Metrics
For a given algorithm, with a specific choice of threshold on the classifier if relevant, the True Positive Rate (TPR) and the False Positive Rate (FPR) for a sample of truth labelled data can be defined as TPR = # pixels correctly identified as cloudy # cloudy pixels (1) FPR = # pixels wrongly identified as cloudy # non-cloudy pixels . ( The perfect algorithm would have TPR = 1 and FPR = 0. For an algorithm with an adjustable threshold, it is possible to draw a curve, called the Receiver

340
Operating Characteristic (ROC) curve, of TPR as a function of FPR as seen in Fig. 4. A threshold of zero will correspond to everything being labelled as cloudy, and thus have (FPR, TPR) = (1, 1) while a threshold of 1 will label everything as clear and have (FPR, TPR) = (0, 0). A random classifier would   To provide a single number that gives the performance of an algorithm, the area under the ROC curve is an often used measure. For a random algorithm, the area will be 0.5, while for the perfect algorithm, it will be 1. When producing the ROC curves for the neural net, it is not possible to plot 355 a corresponding curve for the existing cloud masks as their outputs are binary, and there is no threshold to vary. In Fig. 4, the existing algorithms are thus given as a point and not a curve. When the point is below the ROC curve, it indicates poorer performance.

Surface type 360
The performances of the NN and the existing Bayesian and Empirical masks are evaluated on data that was not used for the training or validation of the NN.
To provide truth level information, this data was still taken from the collocated sample with CALIOP. In Table 2 However, it is with (FPR, TPR) values of (0.66, 0.82) which means that any sample identified as clear by the algorithm will have a high level of contamination with pixels that are actually cloudy (1−TPR = 18% of cloudy pixels over inland 390 water will be identified as clear). The Empirical algorithm suffers from the same performance issues as the Bayesian algorithm, but in an even more pronounced way. The NN does not suffer from that problem as can be seen from that the AUC stays essentially the same for all surface types.

Cloud types 395
In a similar way to that the performance of the NN can be investigated as a function of surface type, the performance can be investigated as a function of cloud type. In this case the TPR is evaluated only on the pixels that are classified from CALIOP as a given cloud type while the FPR is evaluated on all pixels that are clear according to the truth label as before. The ROC curves as well

Optical depth
The 5 km product from CALIOP provides a measurement of the optical depth of thin clouds. With the NN algorithm trained it is possible to compare the output from the NN, as defined in Sec. 3, for collocated pixels in bins of varying optical depth as seen in Fig. 7. As the 5 km product is used, each NN 420 output produced is mapped to the nearest available optical depth measurement which may be up to 2.5km away. For each bin, a distribution of NN output is created, and the median is plotted. Asymmetric error bars covers 34% of the distribution in each direction from the median. The figure demonstrates that the NN is correlated with the optical depth in the region between 0 and 2 optical 425 depths. At higher values of the optical depth, the algorithm saturates. It can also be seen that the algorithm does not provide much separation between clear scenes and an optical depth up to 0.5 (corresponding to the first two bins).  performance while the near infrared channels are. The loss in performance from planing a single variable is larger over land than over ocean. While the S5 and S6 channels are both water absorption channels, it can be seen that they add a large 445 amount of different information to the cloud masking (if they were redundant, the performance of the NN would be unchanged by planing against one of them).
They behave slightly differently according to the optical depth and effective radius of the cloud (Wang et al. (2018)). The channels also exhibit different spectral behaviour with respect to vegetation. The complex nature of these 450 relationships serves to illustrate how a NN approach can provide added value through identification of the multi dimensional relationships between different channels, cloud types and the surface.
In a similar way we looked at the performance of the neural net when we made the training blind to the surface type and the longitude/latitude informa-455 tion. The result was that there was no impact on performance from removing this information within the statistical noise. For longitude and latitude, this is to be expected given the similarity of the polar regions across the range they were trained on. For the surface type, this is more surprising but matches up with that the algorithm has almost identical performance over land and sea.

460
That surface type information is not important as truth information for the algorithm should not be taken to mean that the algorithm will perform well for surface types that it has not been trained on as will be seen through an example in the next section.

465
Performance of the NN cloud identification scheme was analysed as a function of surface type and cloud type. This breakdown of the cloud mask performance enables a more comprehensive assessment of the cloud identification schemes and enables future cloud identification improvements to be more focused on the areas where current algorithms are underperforming.

470
The performance of the NN algorithm was slightly better over Antarctica than over the Arctic. The reason for this slight difference is not clear but could be because there are fewer fixed vegetation/snow sites in this polar region.
The performance of this algorithm is significantly better than that described The NN algorithm can be compared in a qualitative way to the algorithm presented in Jafariserajehlou et al. (2019). We do this by classifying the scene presented in Fig. 13 from that paper and showing it next to our classification in Fig. 9. It can be seen that in this example, the two algorithms provide very 505 similar results. A quantitative comparison would require a project where both algorithms were compared using the identical scenes with the identical truth mapping.
In summary, the algorithm is significantly superior to the existing operational cloud masks in the polar regions where a direct quantitative comparison 510 can be made.
While the algorithm was developed with training images over the polar regions, it is still possible to use it elsewhere. While there is no truth labelled dataset to check the performance against, a visual inspection reveals a very good performance over ocean but also some very poor performance over land.

515
An example of this can be seen in Fig. 10 from a scene over northern Australia.
It can be seen that the algorithm fails over the Australian red soil. This is not surprising as this type of surface is very different to anything that the algorithm was trained on. In order to create a similar algorithm outside polar regions, where timely collocations over different surfaces (with a good representation 520 of difference characteristics such as reflectance and temperature) are possible, truth data sets will need to defined. NASA's Cloud-Aerosol Transport System (CATS) lidar, Yorks et al. (2016), which operated between January 2015 and October 2017 has a small over lap period with SLSTR but the data was not available at the time of analysis. Hand classified data sets will be useful for this 525 purpose, however these are generally time consuming and expensive to generate and will be subject to human biases and internally correlated.
The algorithm has not used the SLSTR backward view, in order to keep the developed mask consistent and applicable to the full swath. However, if the dual view swath is used, the information in the backward view could be exploited 530 to improve the mask result. The algorithm could be developed further through using neighbouring pixels and with the addition of auxiliary data sets.

Conclusion
In this paper we have described the development of a machine learning algorithm to identify clear and cloudy pixels in SLSTR images. The algorithm 535 was trained on collocated data from the CALIOP instrument. Collocations with good temporal matches between CALIOP and SLSTR are only possible in the polar regions which means that the power of using this method could only be shown there. We demonstrate a significantly improved performance compared to existing operational algorithms particularly over land. The algo-540 rithm performed equally well over land and sea. Opaque clouds were identified with greater skill than thin clouds and broken clouds. The sensitivity to optical depth was assessed using the CALIOP 5km product which demonstrated that the ability to detect clouds dropped rapidly below 0.5 optical depths.
For the NN, the same performance in terms of contamination of cloudy pixels The algorithm is fast, just a few seconds per scene, so can be run operationally.
There is potential to improve the algorithm through the use of the oblique view as well and adding other auxiliary data sets. Improvements could also be made through using the texture in the image from a small region around the 555 pixel that should be classified.