A comparative study of transfer learning methodologies and causality for seismic inversion with tem- poral convolutional networks

Data and Code https://github.com/olivesgatech/Image21-Comparative-transfer-learning-seismic-inversion-causality Bib @inproceedings{mustafa2021comparative, title={A comparative study of transfer learning methodologies and causality for seismic inversion with temporal convolutional networks}, author={Mustafa, Ahmad and AlRegib, Ghassan}, booktitle={SEG/AAPG/SEPM First International Meeting for Applied Geoscience & Energy}, year={2021}, organization={OnePetro} }


INTRODUCTION
Building accurate subsurface models plays a critical role in several applications including hydrocarbon exploration and carbon storage. Reservoir characterization for subsurface model building involves the use of well logs and seismic data to obtain estimates of the underlying physical parameters of the subsurface, in a process called seismic inversion. Data-driven inversion algorithms have recently witnessed tremendous improvements, from 1-D sequence models designed to better model the temporal characteristics of the data (Alfarraj and AlRegib, 2018;Mustafa et al., 2019) to 2-D sequence models also capturing the spatial correlations for better lateral continuity (Mustafa et al., 2020), and from semi-supervised machine learning techniques (Alfarraj and AlRegib, 2019a,b) utilizing vast quantities of unlabeled seismic data to transfer learning-based methods incorporating prior knowledge of well logs to improve performance on a given target survey (Mustafa and AlRegib, 2020;Wu et al., 2020;Mustafa et al., 2021). The end goal for all of such works is to reduce the labeled data requirement for datahungry machine learning methods.
Transfer learning as commonly understood in the context of deep neural networks involves pretraining a network on a large source dataset to later be finetuned on a usually much smaller target dataset. The fundamental assumption is that similar datasets share low level representations learnt by the earlier layers and express different representations in the deeper layers. In the case this assumption is violated, transfer learning on a target dataset may degrade the performance compared to the baseline case with no transfer learning. Augmenting a target dataset with labeled training examples from another dataset may also result in performance improvements on the former in certain cases, again subject to the condition that the two datasets were sampled from similar distributions. In contrast, Mustafa and AlRegib (2020) introduced a novel knowledge sharing mechanism involving joint training on source and target datasets by means of a soft weight sharing penalty on the two networks. They showed performance improvements on the target dataset while avoiding negative inductive transfer from a significantly different source dataset. In the context of data-driven seismic inversion attempted via deep learning, we have yet to see a comprehensive comparative study of the different transfer learning methodologies under controlled conditions. This is one of the gaps that this works aims to address.
The work by Mustafa and AlRegib (2020) demonstrated seismic inversion for the case where the seismic traces and impedance pseudologs were of the same vertical resolution for both datasets. Such a scenario is highly improbable to arise in practice owing to the vast differences in resolution and sampling that happen not only between the seismic and well logs of a given survey, but also between the different surveys. Therefore, we extend the methodology implemented in Mustafa and AlRegib (2020) to work for any combination of resolution and sampling rates among and between datasets.
While 2-D temporal convolutional network-based sequence models in Mustafa et al. (2020) resulted in a marked improvement over their 1-D counterparts, they are still restricted in only modeling the data in a causal manner i.e., any depth sample on a well log may be estimated using only seismic trace samples at previous depths. This not only limits the modeling capability of the network, but is also contrary to the physics of how model-based seismic inversion is attempted. Our third contribution in this work is to change the 2-D temporal convolutional network (TCN) to perform non-causal convolutions and to compare the resulting performance to that of the original causal network.
Our contributions in this work are hence threefold: (1) we perform a thorough comparison of the various existing transfer learning methods under controlled conditions, (2) we implement the weight sharing penalty in Mustafa and AlRegib (2020) over selective parts of the networks to remove the restriction of same sampling and resolution factors within and between datasets, and (3) we change the 2-D TCN to perform non-causal convolutions for closer fidelity to the underlying physics of the inversion process.

Given a set of N s labeled training examples
, the transfer learning setup aims to use D s to capture the underlying true distribution, P(y t | x t ) better than would have been possible with D t alone. In the context of deep learning in particular, a popular means to achieve this is by using the pretrained weights of a network learned on D s as a starting point for the training process for D t . This strategy is popularly referred to as "pretraining and finetuning". Another common technique by which the network's generalization capacity may be improved is data augmentation, which involves adding to the source training dataset labeled examples from the target dataset. Using the notation we introduced earlier, such a transfer learning setup would entail setting up a new dataset containing labeled training examples from all the individual datasets. We refer to this as D c = D s D t : the dataset for "combined learning" in other words. Yet another approach to inject the source distribution's knowledge into the network training on the target dataset (and vice versa) is the weight sharing strategy introduced in Mustafa and AlRegib (2020), whereby each dataset is trained on with its own copy of the 2-D TCN architecture introduced in Mustafa et al. (2020). The knowledge transfer happens via a soft constraint added to the objective function to regularize the two networks to find optimal solutions within proximity of each other while also optimizing on their respective datasets. The strength of the weight sharing penalty is a hyperparameter that may be selected to be a suitable value. In this work, we implement all three of the knowledge transfer strategies just described in addition to a control study with no transfer learning whatsoever.

CAUSAL VERSUS NON-CAUSAL TCN
The original temporal convolutional network for impedance model inversion introduced in Mustafa et al. (2019) and extended in Mustafa et al. (2020) models seismic and impedance trace data as sequences, resulting in an efficiently learned mapping from seismic to impedance logs. Given an input sequence, and a corresponding output sequence, {y i } N i=1 , the conventional sequence modeling paradigm models the relationship, f as Being a causal mapping, this is different to an inversion setup where any given point y i would depend on input points both before and after it-a non-causal functional relationship, in other words. This is depicted as where K is ideally large enough to encompass the whole of the input sequence. The formula relating the output length of a convolutional layer to the input length is l out = l in +2× p−d × (k −1), where p refers to the padding added to both sides of the input, d refers to the dilation factor, k is the kernel size, and l out and l in refer to the output and input lengths, respectively. The original TCN achieves causality by substituting p = d ×(k −1) into this equation to obtain l out = l in + d × (k − 1) and thereafter removes the excess d × (k − 1) elements from the end of the output sequence to set l out = l in . To convert this to a noncausal mapping instead, the p parameter is set to be

TRAINING DETAILS
We use the Marmousi 2 and SEAM synthetic datasets to carry out our comparative study of different transfer learning methodologies. Both datasets are publicly available as 2-D impedance models characterizing different subsurface geological settings. Also associated with each impedance model is the corresponding seismic section. Apart from being characteristic of different subsurface geometries, the two datasets also manifest differently in terms of the nature and quality of the seismic data itself; Marmousi 2 is a noise-free product of convolutional forward modeling of the impedance while SEAM is a result of migration with noise levels ranging from medium to severe.
We sample 50 uniformly spaced impedance pseudologs and corresponding seismic data as described in Mustafa et al. (2020) from the Marmousi 2 section to form D s . Likewise, we sample 5 uniformly spaced impedance logs from SEAM to form D t . The network architecture is described in Table 1. Each training session is run for 900 epochs using ADAM as the optimizer and a learning rate of 0.001. For the pretraining and finetuningbased scheme, pretraining is carried out for 100 epochs before finetuning the network on SEAM with the aforementioned settings. It should be kept in mind that since the relative vertical resolutions between the seismic and impedance pseudologs are different for both datasets (four for Marmousi 2 and two for SEAM), we cannot apply the weight sharing penalty to the complete network architecture, as in Mustafa and AlRegib (2020). Instead, only the earlier layers in the two networks are allowed to share weights, with the later layers involved in the upsampling process training independently of each other. , and the r 2 coefficient of determination between the estimated acoustic impedance produced by the method and the ground truth acoustic impedance for the dataset. It can be observed that the weight sharing-based knowledge transfer scheme results in the best average performance on the SEAM dataset (the target) in all four metrics. Notice also the case for when the two datasets are trained on independently (top row) slightly outperforms the transfer learning strategies of pretraining-finetuning and that of combined learning (second and third rows for SEAM, respectively). This can be explained entirely by the fact that the transfer learning performance for these techniques is highly dependent on several factors: pretraining epochs and learning rate, number of pretrained layers and source training examples, finetuning learning rate, number of target examples etc. Optimizing these hyperparameter settings requires several trials and/or subject matter expertise, in the absence of which the performance may either not improve over the baseline case or even degrade in the worst case. This is especially true for the case when the source and target training examples are sampled from radically different distributions. This is not to mention the degradation on the original source dataset that happens as a result of adjusting the network to the target distribution (and thus being no longer optimal for the source dataset), as also seen in row two of the table for Marmousi 2. All of this is avoided by using the weight sharing scheme with a soft penalty since it allows the different network copies to optimize on their respective datasets while being able to use beneficial knowledge in terms of the learned weights of other networks (row 4). However, forcing the two networks to reach the exact solution by placing a large penalty on the weight sharing loss results in severe degradation on both datasets, as evidenced in the last row for Table 2. This happens because there might not exist a single global minimum for two datasets sampled from very different distributions. Also very instructive is to examine the learned weights of the two networks. It is commonly understood amongst deep learning experts that deep networks learn general, domain-invariant representations in the earlier layers followed by more task-and data-specific ones later. It is therefore expected that the difference in learned weights of the two networks be smaller earlier on and larger later. Plotting the sum of squared differences between corresponding learned weights, this is exactly the trend that we observe, as shown in Figure 1. Figure 2 shows the ground-truth acoustic impedance profile for SEAM along with the predicted profiles by causal and noncausal TCNs, respectively. It may be seen (especially in the areas pointed out by the arrows) how the non-causal TCN is able to capture the impedance better than its causal counterpart simply on account of its higher fidelity to the underlying physics. The blind well test at x = 2794m also confirms this hypothesis, with the causal TCN capturing the lower part of the ground-truth impedance better than the non-causal TCN. Plot of SSE b/w layer weights of the Networks Figure 1: Sum of squared errors between corresponding weights between the two networks.

CONCLUSION
While transfer learning can help reduce the risk of overfitting in the presence of limited labeled training data in a target dataset by using labeled training examples from the source, it can also degrade performance on the former if the two datasets happen to be sampled from very different distributions or if the transfer learning hyperparameters are not optimized for the specific use case. In this work, we demonstrate through a comparative study that the knowledge sharing scheme performed through soft weight sharing is superior to conventional transfer learning strategies on both counts. We also show that it is possible to implement the soft weight sharing strategy in the case where the sampling factors among and within datasets are very different through selective application of the weight sharing penalty. Lastly, we also demonstrate that non-causal TCNs outperform causal ones by virtue of their stronger fidelity to the physics of the inversion process. This study opens promising future directions exploring how the joint learning process can benefit in the case of three or more datasets by allowing a transfer of richer representations than possible with only two datasets. Also of great interest to geophysicists would be the extension of 2-D temporal convolutional networks to 3-D spatiotemporal models to better capture the lateral heterogeneity in 3-D seismic surveys. By showing how best practices in machine learning and geophysics can combine to produce superior results, we hope our work can be a significant step towards that end.  Table 2: Regression accuracy metrics for the various methods discussed in the paper. We compute and display, for both SEAM and Marmousi 2, the mean squared error (MSE), mean absolute error (MAE), median absolute error (MedAE), and the r 2 coefficient of determination between the estimated acoustic impedance produced by the method and the ground truth acoustic impedance for the dataset. α refers to the weight assigned to the weight sharing penalty, as explained in Mustafa and AlRegib (2020).