Simultaneous classification and location of deformation in SAR interferograms using deep learning

This manuscript is a non-peer reviewed EarthArXiv preprint that has been submitted for publication in Remote Sensing of Environment. With the evolution of InSAR into a tool for active hazard monitoring, through its ability to detect ground deformation with low latency, new methods are sought to quickly and automatically interpret the large number of interferograms that are created. In this work, we present a convolutional neural network (CNN) that is able to both classify the type of deformation, and to locate the deformation within an interferogram in a single step. We achieve this through building a “two headed model”, which is able to return both outputs after one forward pass of an interferogram though the network, and so does not require the use of a sliding window approach for localisation. We train our model by first creating a large dataset of synthetic interferograms which feature labels of both the type and location of any deformation, but also find that our model’s performance is improved through the inclusion of just a small amount of real data. When building models Preprint submitted to Remote Sensing of Environment January 11, 2021 of this type, it is common for some of the weights within the model to be transferred from other models designed for different problems. Consequently, we also investigate how to best organise interferograms such that the filters learned in models such as VGG16 are sensitive to the signals of interest in interferograms, but find that using different data in each of the three input channels significantly degrades performance when compared to the simple case of repeating (un)wrapped phase across each channel. This implies that the inclusion of supplementary data, which we expect should improve the ability to distinguish deformation from noise, requires training of a network from scratch.

) or, in more complex algorithms, indicating the exact outline 38 of an object by identifying which pixels comprise it (e.g. He et al. (2017)). 39 These approaches should provide more detailed information on the spatial 40 extent of a signal of interest than a classification model that is repeatedly 41 used on different areas of an image. Consequently, we endeavour to develop 42 an algorithm that is able to both classify types of deformation, and localise 43 it within an interferogram in one step. Figure 1 shows our initial division 44 of deformation patterns into different classes, and can be considered similar 45 to the WordNet hierarchy (Fellbaum, 1998) that underpins ImageNet (Deng 46 et al., 2009).
When seeking to build a CNN to perform a classification or localisa-48 tion problem, common approaches can be divided into one of three broad 49 categories depending on the utilisation of pre-existing models. In the most 50 fundamental case, a new model is designed before all the parameters within it 51 are trained (e.g. Rauter and Winkler (2018)), but this approach has the risk 52 of failing to utilise the successful applications of CNNs to other problems. We propose a model that is able to classify interferograms as either containing only atmospheric signals, or as containing deformation due to inflating sills or opening dykes. As our proposed model will work with only data from one look angle, we envisage that deformation due to processes that could be modelled as a point pressure source (commonly referred to as a "Mogi" source (Mogi, 1958)) are likely to be incorporated in the inflating sill label. We do not present this hierarchy as complete, and envisage that future studies may add further subtrees, such as signals due to the cooling and contraction of emplaced lava flows.
channel. However, when considering an image of interferometric phase, these images contain only a single value for each pixel, and so consist of only 88 one channel, and are analogous to a greyscale image. This difference in the 89 number of channels can be circumvented through duplicating the one channel 90 interferogram in each of the three input channels of a CNN, but in this section 91 of our study we wish to determine if this approach can be improved upon. 92 When two SAR images are combined to form a single interferogram, the 93 resulting image is a 2D array of complex numbers. Whilst the magnitude 94 of each of these complex numbers relates to the underlying brightness and 95 coherence of a given pixel, it is common for only the argument to be displayed, 96 as these phase values can be used to infer ground movement. However, the 97 phase values of an interferogram are wrapped in the range [−π π] as only the 98 fractional part of the phase value can be measured, but this ambiguity can be 99 estimated to produce an unwrapped interferogram (Chen and Zebker, 2001). 100 We postulate that in addition to the use of either wrapped or unwrapped 101 data duplicated to fill three channels, the original complex numbers of an 102 interferogram could be used in two channels, and so allow the network to use To perform this analysis, we first synthesise a dataset of labelled interfer-112 ograms. The collection of enough labelled data to train a CNN is commonly 113 time consuming or expensive, and we find that the addition of localisation 114 labels to our data makes it more time consuming than in previous studies.

115
Additionally, due to the large number of data that are required to train 116 CNNs and our expansion to classification of different types of deformation, 117 procuring enough real data to do this may be not possible. Consequently, 118 we perform this analysis using only synthetic data. Following the hierarchy 119 proposed in Figure 1, we create interferograms that contain either no de-120 formation, deformation due to an opening dyke, or deformation due to an 121 inflating sill. We model the dykes and sills as approximately vertical and 122 horizontal dislocations, respectively, with uniform opening in an elastic half 123 space (Okada, 1985). For the set of sills, we randomly selects strikes in the  (González et al., 2016). Figure 2 shows the results 138 of mixing these different elements to create our synthetic interferograms.

139
This process creates unwrapped data, which can be converted to wrapped 140 data through finding modulo 2π of the unwrapped phase. However, to syn-141 thesise both the real and imaginary part of a complex interferogram requires 142 knowledge of both the brightness of a pixel and its phase. To achieve this, we 143 again use the SRTM DEM, and calculate the intensity of reflected electro-144 magnetic radiation at the angles of incidence used by the Sentinel-1 satellites 145 (29.1 − 46.0 • ), before adding speckle noise, and calculating the interferomet-146 ric amplitude between two images (i.e. the product of the two amplitudes).

147
As inputs to CNNs that are to be trained using transfer learning must be 148 rescaled to the inputs used in the original training, we use only relative val-149 ues in the range [(−1) − 1] for the synthetic intensities. With knowledge of 150 the modulus (relative intensity) and argument (wrapped phase) of each pixel 151 of our synthetic interferogram, the real/imaginary components are simply 152 the products of the modulus and cosine/sine of the argument, respectively. 153 Figure 3 shows five different ways we can represent an interferogram using 154 the three channels available.

155
The CNN we build to classify the synthetic interferograms uses the five 156 convolutional blocks of VGG16 (Simonyan and Zisserman, 2014), with our 157 own fully connected network after this. This network was chosen as, when 158 used in the field of computer vision for classifying natural images, it outper-159 formed older models such as AlexNet (Simonyan and Zisserman, 2014), which 160 is used in the algorithm presented in Anantrasirichai et al. (2018). When an 161 interferogram of shape (224 × 224 × 3) is passed through the convolutional 162 layers of VGG16, it is transformed into a tensor of shape (7 × 7 × 512). This 163 is then flattened to make a vector of size 25, 088, before being passed through 164 fully connected layers of size 256, 128, and an output layer of size three (i.e., 165 dyke, sill, or no deformation). To produce a set of outputs that can be used 166 as probabilities, we use a softmax activation for the last layer (Bridle, 1990), 167 but on the remaining layers we use rectified linear units (ReLus) to reduce 168 computation time (Agostinelli et al., 2014). As our model seeks to solve a 169 classification problem, we use categorical cross entropy for the loss function, 170 which we seek to reduce using the Nadam optimizer as this does not require 171 the choice of a learning rate (Dozat, 2016).

172
A common problem of CNNs that are used for classification can be over-173 fitting of the training data, which results in a model that generalises to new 174 data poorly. We endeavour to limit this through the use of dropout (Sri-  In the previous section, we demonstrated that, when using VGG16 with 203 convolutional weights learned on ImageNet data, roughly optimal perfor-204 mance for classifying synthetic interferograms is achieved when either the 205 wrapped or unwrapped phase is repeated across the three input channels. 206 We choose to progress with only the unwrapped phase model, as the compu-207 tational cost of unwrapping is often already met by automatic processing sys-208 tems (e.g. LiCSAR, González et al. (2016)), and the development of models 209 that use unwrapped phase may lead to benefits such as the ability to classify 210 Figure 2: An example of the constituent parts of seven synthetic interferograms. A third of these do not feature deformation (e.g. interferogram 5), a third feature deformation due to an inflating sill (e.g. 4), and a third feature deformation due to an opening dyke (e.g. 2). These signals are geocoded and areas of water masked, before being combined with a topographically correlated APS, and a turbulent APS. Areas of incoherence are also synthesised, and these are used to mask the combination of the three signals to create the final synthetic interferograms. Figure 3: Organisation of an interferogram into three channel form. Columns one and two feature unwrapped data that is repeated, and in column two the DEM is included as the third channel. In column three the real and imaginary elements of the complex values of each pixel of an interferogram occupy channels one and two, whilst the DEM is included in the third. Columns three and four feature wrapped data that is repeated, and in column five the DEM is included as the third channel. Figure 4: Accuracy of classifying validation data (10% of the total) during training using three channel data arranged in different formats. "u": unwrapped data, "w": wrapped data, "d": DEM, "r" real component of interferogram, "i": imaginary component of interferogram. Low accuracy is seen for the "rid" data, and in both the wrapped and unwrapped cases inclusion of the DEM in the third channel is seen to degrade classification accuracy. At the end of the 20 epochs of training, only a small difference is seen in accuracy between wrapped and unwrapped data, with both classifying ∼95% of the validation data correctly, though the wrapped phase model is seen to achieve this level of accuracy more quickly (requiring only eight epochs of training). and locate unwrapping errors. In this section, we build on the model used 211 to perform classification by adding localisation output. We also endeavour 212 to ascertain if the expense of collecting labelled data can be avoided entirely 213 through the continued use of synthetic data when training our model. 214 We achieve both classification and localisation through dividing the fully 215 connected section of our model to produce two distinct outputs. One output 216 returns the class of the input data in the manner described in Section 2, 217 whilst the second returns the location of any deformation within the scene.

218
In machine learning parlance, models of this type are termed double headed, 219 and we subsequently refer to either of the outputs and their corresponding 220 preceding layers as either the classification head or localisation head. learning rate, but we find that a value of 1.5 × 10 −8 does not degrade the 283 performance already gained during training, but still allows for the validation 284 localisation loss to decrease from ∼800 to ∼700 pixels (i.e. a mean error of 285 ∼2.6 km), and the classification accuracy to increase from ∼0.8 to ∼0.85. Figure 7 shows the results of applying our trained classification and lo-287 calisation model to a random selection of the testing data (i.e., data that the 288 model were not exposed to during training). In the majority of cases, the 289 classification can be seen to be accurate, and the localisation approximately 290 correct. When considering the entire test set of data, the classification ac-291 curacy is 0.89, whilst the localisation loss is ∼700. It should be noted that 292 we could also report the classification loss (0.31), but we believe this is less 293 useful than the accuracy. However, in the localisation case, accuracy is not 294 a meaningful measure of the fidelity of the output, as it is instead a regres-295 sion problem in which we aim to approximate the correct values, which are 296 continuous in nature. In a manner similar to that reported for the validation 297 data, the localisation loss (mean squared error) of ∼700 pixels corresponds 298 to a mean error of ∼2.6 km (when using three arc second pixels).   Deformation units are centimetres, black class labels and location boxes were generated from the synthetic data and span areas with over 5 cm of deformation, whilst red depicts those predicted by the CNN. As the model outputs a probability for each label, these are included as decimals for each of the predicted classes. Inspection of the results shows that in all but one of the randomly chosen cases, the localisation is broadly correct, and the classification is correct. Interferogram 2, which is classified incorrectly, features a relatively strong turbulent APS (seen as the spatially correlated noise) and a deformation pattern that extends into an area of incoherence, which may explain the misclassification.
in some examples assigning a single class to a complex deformation pattern 311 is difficult, and we instead assign what we deem the dominant class to be,

312
whilst expecting that the network should assign some probability to other 313 classes. This is most evident in interferograms seven, nine and ten of Figure 7   is not located or classified accurately.  and real data. For both cases, the models can be seen to achieve good accuracy when classifying interferograms that contain either no deformation or deformation due to the inflation of a sill, but to misclassify interferograms that contain deformation due to an opening dyke (accuracies of 0.00 and 0.67). Significant reduction in localisation loss is also seen for interferograms that contain no deformation (1423 to 531 pixels 2 ), suggesting that the inclusion of real data improves the model's ability to refrain from interpreting atmospheric signals as the location of deformation.
synthetic data is able to classify and locate deformation signals in Sentinel-1 data. However, it is only successful in cases with particularly clear defor- of the set of "no deformation" data, or combined with synthetic deformation 426 patterns to produce more complex semi-synthetic data.

427
The results presented in Figure 9 show the benefit of incorporating real 428 data. However, much scope for improvement remains, with several classi- the signal to noise ratio to exist below which a method is not able to identify 434 the signal of interest, and these interferograms appear to represent that. In the latter case, the interferograms in question contain complex deformation 436 patterns due to both the opening of a dyke and the removal of magma from 437 a sill below the caldera (Novellis et al., 2017), and the inclusion of either real 438 of synthetic training data that contains multiple deformation patterns may 439 alleviate this shortcoming.

440
The divergent nature of the two heads (classification and localisation) of 441 our network also allows for discrepancies between their outputs. This is seen 442 in interferogram 10 of Figure 9, in which the localisation head produces a 443 broadly correct output, but the signal is incorrectly labelled as "no defor- At present, an input with side length 224 is reduced to a feature map 465 with side length 7 (shown in Figure 5) which, combined with a depth of 512, finds that 8% of signals are located more than 10km from a volcanic edifice, 474 and would therefore be missed by our current model. Future models that 475 wish to perform localisation using a global approach may therefore require 476 slight increases in size in order to capture all signals of interest.

478
We find that either wrapped or unwrapped data are approximately equally 479 suited for use with the weights of VGG16's filters trained on ImageNet data, 480 whilst more complex use of the three channel format that these models sup-481 port degrades performance. However, we expect this will not be the case if 482 the weights within VGG16's filters are trained from scratch, as additional 483 data such as topography should help to separate deformation from noise.. 484 We combine the five convolutional blocks of VGG16 with two fully con-485 nected networks to perform both classification and localisation, which allows 486 our network to reason using the whole interferogram (i.e. avoiding a sliding 487 window approach), and therefore move a step closer to interpreting InSAR 488 data in a manner similar to a human expert. Additionally, our network is 489 able to differentiate between several different forms of deformation.

490
To minimise the costly nature of labelling data, we initially train our 491 model using only synthetic data. We find that our model generalises well 492 to some cases of Sentinel-1 data, but errors remain in cases such as sub-  (Hunter, 2007), and all CNN work was carried out in KERAS using the 506 TensorFlow backend.