Geological Facies modeling based on progressive growing of generative adversarial networks (GANs)

Geological facies modeling has long been studied to predict subsurface resources. In recent years, generative adversarial networks (GANs) have been used as a new method for geological facies modeling with surprisingly good results. However, in conventional GANs, all layers are trained concurrently, and the scales of the geological features are not considered. In this study, we propose to train GANs for facies modeling based on a new training process, namely progressive growing of GANs or a progressive training process. In the progressive training process, GANs are trained layer by layer, and geological features are learned from coarse scales to fine scales. We also train a GAN in the conventional training process, and compare the conventionally trained generator with the progressively trained generator based on visual inspection, multi-scale sliced Wasserstein distance (MS-SWD), multi-dimensional scaling (MDS) plot visualization, facies proportion, variogram, and channel sinuosity, width, and length metrics. The MS-SWD reveals realism and diversity of the generated facies models, and is combined with MDS to visualize the relationship between the distributions of the generated and training facies models. The conventionally and progressively trained generators both have very good performances on all metrics. The progressively trained generator behaves especially better than the conventionally trained generator on the MS-SWD, MDS plots, and the necessary training time. The training time for the progressively trained generator can be as small as 39% of that for the conventionally trained generator. This study demonstrates the superiority of the progressive training process over the conventional one in geological facies modeling, and provides a better option for future GAN-related researches.


Introduction
Subsurface geological facies modeling is a very important part of the workflow for accurate assessment of subsurface resources such as groundwater, petroleum, and carbon storage. Geological facies modeling is a process of integrating various observed data (e.g., well data, seismic data, and outcrops) and geological patterns, and predicting the range of spatial geological facies distributions in the subsurface. Many geostatistical methods have been used for facies modeling, including variogram-based methods, multiple points statistics (MPS)-based methods, object-based (Boolean) methods, and process-mimicking methods (see e.g. [1,2]). These traditional methods have various advantages and disadvantages. Some of them are still under research, such as the development of treebased direct sampling MPS method [3] and many variations of forward stratigraphic modeling (e.g. [4]).

Key Points
• Progressive growing of GANs is used for geological facies modeling, where geological features are learned from large scales to fine scales.
• Multi-scale sliced Wasserstein distance, multi-dimensional scaling plot, facies proportion, variogram, and channel sinuosity, width, and length are proposed as metrics to evaluate GANs.
• Progressive growing of GANs behaves better than conventional training processes in many metrics including computation time. With the quick improvement of algorithms and enlargement of datasets, deep learning (DL) has been providing robust solutions for many complicated problems that cannot be addressed using conventional methods. These problems can be within any field including geosciences. Tahmasebiet et al. [5] reviewed commonly used machine learning (including deep learning) algorithms and their state-of-the-art applications in various problems of geosciences. These problems include prediction of rock parameters, reconstruction of porous media, modeling of ground water level, soil moisture prediction, carbon leakage management. Widely used DL techniques vary from convolutional neural networks, recursive neural networks, variational autoencoder, to generative adversarial networks, etc. DL techniques can also be combined with conventional statistical methods for better performances, such as the combination of recursive convolutional neural networks with MPS method [6].
Generative models have also been used in the geosciences, especially in the geological facies modeling problem. Laloy et al. [24] used VAE for low-dimensional representation of binary geological models. Mosser et al. [25] and Mosser et al. [26] used DCGAN for generating three-dimensional solidvoid structure of porous media and micro-CT-scale oolitic Ketton limestone. Chan and Elsheikh [27] used Wasserstein DCGAN to generate geological facies models.
In the above GAN applications, all layers of GANs were trained concurrently. Karras et al. [17] proposed an alternative to the conventional training process -the progressive growing of GANs or the progressive training process, i.e. training GANs layer by layer, and proved the superiority of this progressive training process over the conventional training process in multiple applications. The focus of this paper is to investigate the progressive GAN training process for geologic facies modeling, and compare the progressive training process with the conventional training process, based on a set of evaluation metrics, including realistic reproduction of facies distributions and computation time. In addition, we also build two large training datasets for facies modeling. These datasets can be used for many facies modeling-related future works. This paper is organized as follows. In section 2, we introduce related background materials on GANs. Section 3 presents the workflows and settings on how to train GANs for facies modeling with both the conventional and the progressive training processes. In section 4, we build a set of evaluation metrics for the trained GANs. Section 5 presents the facies model training datasets. Then, in section 6 we compare and analyze the results of the trained conventional GANs and progressive GANs. Finally, conclusions are provided in section 7.

GAN framework
The framework of GANs was proposed by Goodfellow, Pouget-Abadie [12] to address the generative problem. Given many observed real samples x r 's from an unknown distribution p data over a high-dimensional space X , i.e., x r~pdata , the goal of GANs is to train a generative model that can reproduce a distribution p G to approximate p data . Figure 1 shows the typical workflow of a GAN. In general, a GAN includes two trainable blocks, a generator (G θ ) and a discriminator (D φ ) (θ and φ are trainable parameters). We define a latent variable Z with a known distribution p z (e.g., Gaussian) over a low-dimensional space Z, and define z as a sample from Z, i.e., z~p z . The generator G θ maps z into x G over the space X , i.e., x G = G θ (z); the distribution of x G is p G , i.e., x G~pG . The discriminator D φ maps the generated sample x G and the given data sample x r into two scalar values, which are called scores; the two scores are abbreviated as s G and s r , i.e., s G = D φ (x G ), s r = D φ (x r ). The scores evaluate the realism of the inputs of D φ . The loss function of GANs is based on some type of distance between s G and s r (Eq. (1)); in essence, the loss represents the distance between the p G and p data . We will discuss the loss function in more detail in section 2.2. The discriminator D φ and the generator G θ are alternatively trained by maximizing and minimizing the loss (Eq. (1)), until a certain stopping criterion is reached. In practice, usually a batch of z's and x r 's are taken as inputs for the training of GANs.
In the above equation, L(G θ , D φ ) is the loss function, Dist(s G , s r ) is some type of distance between s G and s r , and E is the expectation over x G~pG and x r~pdata . Figure 2 gives an intuitive way to understand the mechanism behind GANs. For better understanding, here we specify the loss function in Eq.
, which is the L1 distance between the expectation of s G 's and the expectation of s r 's. The loss function is affected by both G θ (z) (the yellow mapping from z to x G in Fig. 2) and D φ (the dark blue curve in Fig. 2). Whenever D φ is trained, the dark blue curve is adjusted to be larger on the left and smaller on the right, to increase the loss function, discriminating better between s G 's and s r 's, e.g., from (a) to (b) or from (c) to (d) in Fig. 2. On the contrary, when G θ (z) is trained, x G 's are shifted a step closer towards x r 's, to decrease the loss, Fig. 1 The basic GAN framework in the context of the geological facies modeling. The generator G θ maps a latent vector (z) into a "fake" generated facies model (x G ). The discriminator D φ maps x G and a "real" facies model (x r ) into two scores, one real score and one fake score. The scores evaluate the realism of the inputs of D φ . Finally, the loss is calculated based on Eq. (1); D φ and G θ are alternatively trained by maximizing and minimizing the loss Fig. 2 A schematic illustration of the training of GANs. Training the discriminator (D φ ) adjusts the dark blue curve to be larger on the left and smaller on the right, to increase the loss function, to better discriminate the generated samples (x G 's, blue) from the real samples (x r 's, red), e.g., from (a) to (b) or from (c) to (d). On the contrary, training the generator (G θ (z)) maps x G 's a step closer towards x r 's, to decrease the loss function, e.g., from (b) to (c) or from (d) to (e). Every successive pair of the training of D φ and G θ actually pushes x G 's a step closer to x r 's, e.g., from (a) to (c) or from (c) to (e); finally x G 's and x r 's completely mix in (f), representing p G = p data e.g., from (b) to (c) or from (d) to (e) in Fig. 2. Every successive pair of the training of D φ and G θ pushes x G 's a step closer to x r 's, e.g., from (a) to (c) or from (c) to (e) in Fig. 2; finally, x G 's and x r 's completely mix, representing p G = p data , e.g., (f) in Fig. 2.
Basically, the generator and the discriminator are functions with trainable parameters, and these functions can be of any form. When GANs were first proposed, the generator and the discriminator were multilayer perceptrons [12]. In recent years, depending on the practical needs, different types of architectures have been proposed for the generator and the discriminator. Generally, if the outputs of the generator are spatially correlated data (e.g., images), the architectures of generator and discriminator are designed to be convolutional neural networks (CNN) (e.g., [13,28,29]); if the outputs of the generator are time-related sequence data (e.g., natural language), the architectures of the generator and discriminator may be designed to be recurrent neural networks (RNN) (e.g., [30]). The architecture design of the generator and the discriminator can be very complicated in complicated applications (e.g., [31,32]).

GAN loss
Since the GAN framework was proposed, many different types of loss functions have been studied [33], and each of them corresponds to a type of distance between p G and p data .
Originally, Goodfellow, Pouget et al. [12] proposed the loss function given below in Eq. (2), where the last layer of D φ is a sigmoid function.
This loss function was proved to be equivalent to having a Jensen-Shannon divergence between p data and p G [12].
Arjovsky et al. [14] showed that the Wasserstein distance is more sensitive than the Jensen-Shannon divergence, and proposed the Wasserstein loss function based on the Wasserstein distance between p data and p G shown in Eq. (3) where D φ does not have the sigmoid function in the last layer and should be changing slowly. Gulrajani et al. [15] improved the Wasserstein loss function by using a gradient penalty to enforce the gradient of D φ to be small. This improved loss function is shown in Eq. (4).
Here, D φ does not have the sigmoid function in the last layer; λ is a predefined weight, and b x is sampled between x r~pdata and x G~pG using t~uniform(0, 1), i.e., b x ¼ tx þ 1−t ð Þx G .

Progressive growing of GANs
The training of GANs is actually a process of forcing the generator to learn all the features of p data ; the more features are learned, the closer p G and p data would be ((a) to (c) in Fig.  2); if all features are learned, then p G = p data ((f) in Fig. 2).
Features of a distribution have various scales; for example, in terms of the distribution of human face images, the scale of face gesture feature is larger than the scale of mouth size feature. In most GAN related researches, all layers of GANs are trained concurrently and the scales of features are not considered, so the generator has to manage and learn different scales of features completely by itself; this may result in an inefficient way of feature learning of the generator, e.g., some fine-scale features may be learned earlier than the large-scale features. Therefore, Karras et al. [17] proposed a new training methodology for GANs: a progressive growing of GANs or a progressive training process, in which the generator and the discriminator are trained layer by layer. The workflow for progressive growing is as follows. First, the original highresolution training data x r are downsampled into lowerresolution training data at different resolution levels, e.g., downsampling 1024×1024 training images into 512×512, 256×256,…,4×4 training images. These different levels of low-resolution training data represent different scales of features of the original-resolution training data. The lowerresolution downsampled training data capture only the coarse scale features represented in the training data. Second, the shallowest layers of the generator and the discriminator are activated, and the GAN is trained with the lowest-resolution training data. After training, the largest-scale features represented by the lowest-resolution training data are learned by these activated shallowest layers of the generator. Third, the next shallower layer of the generator and the discriminator are further included, and the GAN is trained with the second lowest-resolution training data. After training, the features at the next level of resolution represented by the input training data are learned by the newly activated shallower layers of the generator. Then, increasingly more layers of the generator and the discriminator are progressively included, and largerresolution training data are used to train the GAN. In this way, the shallower layers of the generator learn the largerscale features of the training data, while the deeper layers of the generator learn the finer-scale features of the training data. Throughout the training process, all activated layers of both the generator and the discriminator remain trainable. New layers of the generator and the discriminator are included smoothly, meaning their contribution increases from 0 to 1 to avoid disrupting the already well-trained, shallower layers. The progressive growing of GANs provides a process of learning different scales of features from large scales to fine scales. This philosophy is also well revealed in the multiplegrid geological facies modeling approach, which was proposed by Tran [34] and has been widely used especially in MPS modeling methods (e.g., [35][36][37]). In the multiple-grid modeling approach, the coarse grid is first simulated to capture large-scale structures from the training image with a coarse template, then the fine grid is simulated, conditioned to the coarse grid simulation, to capture fine-scale structures with a fine-scale template.

Geological facies modeling
We train GANs for the geological facies modeling problem, with both the conventional training process and the progressive training process, and compare their results. In this section, first, we introduce the GAN architectures we use in this study; then we describe the settings for the GAN training; finally, we specify the progressive training process, based on the specified architectures of the generator and the discriminator. The conventional training process is standard (e.g., [14]), so we do not discuss it in detail.

GAN architectures used in this study
The facies model is spatially correlated, so we use CNN for both the generator and the discriminator. The architecture of the generator is shown in Fig. 3 and Table 1. The input of the generator is a 128×1 latent vector that is sampled from a standard Gaussian distribution, i.e., z~Gaussian(0, 1); the output is a 64×64 facies model. The architecture of the generator includes 1 fully connected neural network layer with 128 input neurons and 2048 output neurons, 1 reshape layer, 4 upsampling layers, and 10 2-dimensional convolutional layers. The reshape layer converts the 2048×1 vector into the 4×4×128 feature cube, corresponding to 128 feature maps of size 4×4. Each upsampling layer dilates the heights and widths of the feature cubes by 2, using the nearest-neighbor upsampling method. The kernel size for the last convolution layer of the generator is 1×1, while the kernel size for the other convolutional layers is 3×3; the stride size for all convolutional layers is 1×1. The leaky rectified linear unit Fig. 3 The architectures of the generator and the discriminator used in this study. The input of the generator is a 128-dimensional vector sampled from the Gaussian distribution, i.e., z~Gaussian(0, 1); the input propagates through the fully connected layer, the reshape layer, multiple convolutional layers, and multiple upsampling layers; finally the generator outputs a 64×64 2-dimensional facies model. The discriminator takes the 64×64 facies model as input and finally outputs a score to evaluate the realism of the input Fig. 4 Illustration of how a mini-batch standard deviation layer works. The batch of feature cubes on the left are calculated from the earlier layers of the discriminator. Each feature cube includes multiple vertical square feature maps. The mini-batch standard deviation layer first calculates the standard deviation for each feature map at each spatial location over the mini-batch to obtain a standard deviation feature cube. Then the cube of standard deviations are averaged to obtain a single overall average value. This value is replicated into a feature map of the same resolution as other feature maps, and is concatenated to other feature maps of the mini-batch function (LReLU) with a leaky value of 0.2 is used as the activation function in all hidden layers except the last for which only the linear function to add bias is used. As seen in Fig. 3 and Table 1, the architecture consists of five blocks of layers producing feature maps with coarser to finer resolution -4×4, 8×8, 16×16, and 32×32, and finally 64×64. The first fully connected layer is also included in block 1 (4×4) for easier description. These five blocks are trained progressively starting from the coarse 4×4 block to finer and finer blocks ending with 64×64 output image. The progressive training process is described in more detail later.
The architecture of the discriminator is essentially symmetrical to the generator, with corresponding blocks of layers producing feature maps with coarse (4×4) to increasingly fine resolution (8×8, 16×16, 32×32, and 64×64), except that the output of the discriminator is a scalar and a mini-batch standard deviation layer (discussed below) is applied at the discriminator ( Fig. 3 and Table 1). The kernel size for the first convolutional layer of the discriminator is 1×1, while the kernel size for the other convolutional layers is 3×3. The minibatch standard deviation technique was proposed by Karras et al. [17] to increase the variability of the generated results. The calculation steps for the mini-batch standard deviation layer are as follows (Fig. 4): first, calculate the standard deviation for each feature map at each spatial location over the mini-batch; second, average these calculated standard deviation values over all feature maps and spatial locations to obtain a single value; third, replicate the value into an additional feature map with the same resolution as other feature maps; finally, concatenate the new feature map with other feature maps of the mini-batch, increasing the channels (or the number of feature maps) by one, from 128 to 129 after concatenating the mini-batch standard deviation layer (MSD in Fig. 3). This layer could be inserted anywhere in the discriminator, but it is best to be inserted towards the end [17].

Loss function training
We use the Wasserstein loss function with gradient penalty (W-gp). During the training, the trainable parameters (i.e., θ and φ) are initialized with He initialization approach [38]. To speed up the training process, mini-batch gradient descent and the Adam optimizer [39] with the default parameters (i.e., α = 0.001, β 1 = 0.9, β 2 = 0.999, and ε = 10 −8 ) are used. Every mini-batch is set to include 32 facies models. In many studies, the number of optimization steps for the discriminator in each loop is set to be larger than 1, i.e., the training alternates between multiple iterations of optimizing the discriminator and one step of optimizing the generator. In our study, we set the optimization steps of the discriminator to be 1 in each loop, because the Wasserstein loss can largely stabilize the training process. In this work, we call each pair of the optimizations of the discriminator and the following generator as one step of alternative training. We use Tensorflow (tensorflow. org), an open-source deep learning framework, to construct and train our GANs. 2 GPUs (NVIDIA Tesla V100-PCIE-32GB), 10 CPUs, and 80G RAM are used in parallel for the training.
The criterion for stopping the training is important during the training of GANs. Different from other deep learning approaches, the loss in GAN, by definition, does not measure the accuracy, although observations of some experiments show The progressive GAN training workflow used in this study. In phase 1, the layers of block 1 in the generator and the discriminator in Fig. 3, are trained from scratch with the 4×4-size training facies models. In phase 2, the layers of block 2 in the generator and the discriminator are smoothly included. The trainable parameters of the layers marked with blue triangle are initialized from scratch, and the remaining trainable parameters are initialized from previous phase. All activated layers in phase 2 are trained with the 8×8-size training facies models. Similar to phase 2, more and more blocks of layers in the generator and the discriminator are trained in successive phases until all layers are trained that the Wasserstein loss reflects the quality of the generated images [15]. We use two evaluation metrics as the stopping criteria, i.e., the multi-scale sliced Wasserstein distance and visual inspection. The multi-scale sliced Wasserstein distance measures the distance between the distribution of the generated results and the distribution of the training data, over multiple scales; visual inspection refers to visually inspecting the generated results during the training process by humans, until the generated results are visually indistinguishable from the training data. The training stops when the multi-scale sliced Wasserstein distance converges and the generated results are visually indistinguishable from the training data. We will discuss the multi-scale sliced Wasserstein distance and the visual inspection in detail in Section 4. We evaluate the generator every 20 k steps of alternative training based on the above criteria to decide when to stop the training process. Figure 5 shows the progressive training workflow of this study. Different from the conventional training workflow, we downsample the original 64×64-size training facies models into 32×32, 16×16, 8×8, and 4×4-size training facies models, using averaging. These downsampled facies models and the original ones are used during the following process.

Progressive training workflow
The layers of the generator and the discriminator are progressively trained block by block from coarse to fine (shallow to deep), in different phases. In phase 1, we activate the layers of block 1 (4×4) in the generator and the discriminator (Fig. 5), and add two additional convolutional layers (i.e., "CV(1×1)" in Fig. 5). The two convolutional layers convert the output (the 4×4×128 feature cube) of block 1 in the generator into a 4×4 facies model and convert a 4×4 facies model, either a training or a generated facies model, into a 4×4×128 feature cube as the input of block 1 in the discriminator. The involved layers are initialized from scratch. We train the activated layers in this phase with 4×4-size training facies models.
In phase 2, we now activate the layers of block 2 (8×8) in the generator and the discriminator, and add two new convolutional layers ("CV(1×1)" in Fig. 5) that convert the output (8×8×128 feature cube) of block 2 in the generator into a 8×8 facies model and correspondingly convert a 8×8 facies model into a 8×8×128 feature cube. To avoid disrupting the already trained layers in the previous phase, the newly activated layers (block 2) are included smoothly in the generator and the discriminator by a weighted average, meaning that the contribution of the new layers increases gradually from 0 to 1. We upsample the 4×4 facies model generated from block 1 in the generator into a 8×8 facies model, and average that 8×8 facies model from block 1 with the 8×8 facies model converted from the output of block 2, with a weighting factor α. The generator output now is thus (1 − α)× (upsampled 8×8 facies model from block 1) +α×(8×8 facies model from block 2). In the discriminator, the input 8×8 facies model, either a training or a generated facies model, is both directly converted into a 8×8×128 feature cube and downsampled into a 4×4 facies model. Block 2 in the discriminator takes that 8×8×128 feature cube as input and generates a 4×4×128 feature cube. At the same time, the downsampled 4×4 facies model is converted into another 4×4×128 feature cube by a convolutional layer. These two 4×4×128 feature cubes are then averaged into a new 4×4×128 feature cube with the same weight α, i.e., (1 − α)× (4×4×128 feature cube converted from 4×4 facies Fig. 6 We apply MS-SWD to assess the distance between the distribution of the generated facies models and the distribution of the training facies models. We calculate the Laplacian pyramid representations for 4000 generated facies models and 4000 randomly selected training models. Next, we randomly sample 32 5×5-pixel patches from the Laplacian pyramid representation of each facies model at each level. We normalize these patches with the mean and the standard deviation of each patch. Finally, we calculate the SWD between these patches from the generated facies models and from the training facies models, at each level model)+α×(4×4×128 feature cube generated from block 2). This new 4×4×128 feature cube finally goes through the layers of block 1 in the discriminator to output the score. The weighting factor α represents the contribution of the newly activated layers of block 2. It first changes linearly from 0 to 1, to ensure that the newly activated layers fade in smoothly without breaking the already learned features stored in the previous block. After that, α is held constant at 1. This allows training of all parameters (in the newly activated block as well as the previous block) to the same degree, to enhance the  In a similar manner as in phase 2, we progressively train more and more blocks of layers in phase 3 (16×16), phase 4 (32×32), and phase 5 (64×64) until all layers in the generator and the discriminator are trained.

Evaluations of GANs
We evaluate the generator by assessing its generated facies models, based on the following metrics described below: visual inspection, multi-scale sliced Wasserstein distance, multidimensional scaling plots, facies proportion, variogram, and channel sinuosity, width, and length.

Visual inspection
Visual inspection of the generated samples is one of the most common and intuitive ways to evaluate GANs [40]. We generate large numbers of facies models, and assess their quality (i.e., realism and diversity) by comparing them with the training facies models. Visual inspection is used both in the training process as a stopping criterion and after the training process to inspect the performance of the trained generator.
Although visual inspection is the simplest approach for GAN evaluation, it is expensive, biased, requires background knowledge, and cannot fully reflect the capacity of the generator [40]. We further use several other quantitative evaluation metrics.

Multi-scale sliced Wasserstein distance (MS-SWD) and multi-dimensional scaling
Many researchers use the multi-scale structural similarity index (MS-SSIM) [41] or directly the structural similarity index (SSIM) to evaluate the generator's output (e.g., [42,43]). Karras et al. [17] found that MS-SSIM can reveal large-scale mode collapse reliably but fails to react to smaller effects such as loss of variation in colors or textures, and MS-SSIM does not assess the realism of the generated samples. Therefore, Karras et al. [17] proposed multi-scale sliced Wasserstein distance (MS-SWD) to evaluate the generator, based on the intuition that a successful generator should produce samples whose structure is similar to the structure of the training data over all scales. We apply MS-SWD for our study (Fig. 6) as follows. First, we randomly generate 4000 facies models from the trained generator and randomly select 4000 training facies models. Second, we obtain the Laplacian pyramid representations [44] of both the generated facies models and the training facies models from resolution of 64×64 to 16×16. The Laplacian pyramid representations reveal the structures of the original facies models at different scales. Third, we randomly extract 32 5×5-pixel patches from the Laplacian pyramid representation of each facies model at each level, to obtain  As the training progresses, the generated facies model distribution gets closer to the training facies model distribution, but after 420 k training iterations there is minor or no improvements of the relationship between these two distributions Fig. 11 Transition of the generated facies models with the same input latent vectors after different iterations of alternative training, during the conventional training process 128,000 patches respectively from the generated facies models and the training facies models at each level. Fourth, we normalize these patches with respect to the mean and the standard deviation of each patch. Finally, we calculate the sliced Wasserstein distance (SWD), an efficient approximation to the Wasserstein distance [45], between the patches from the generated facies models and the patches from the training facies models at each level. The smaller the MS-SWD, the closer is the distribution of the generated facies models to the distribution of the training facies models, and better is the performance of the generator with respect to both realism and variation of the generated facies models. We average MS-SWD over different levels to obtain one value to assess the distance between two distributions; this value is defined as the average SWD in this study. To evaluate the generator more objectively, we also randomly select two groups of the training facies models (4000 in each group) and calculate their MS-SWD, and their average SWD is regarded as the ground truth baseline. In the progressive training, the resolution of the generated facies models varies from 4×4 to 64×64; for the low-resolution (<64×64) facies models, we first upsample them into 64×64 with the nearest-neighbor method, and then calculate the MS-SWD.
Although the calculated MS-SWD can be indicative of the closeness between the distributions of the generated and the training facies models, we may still have no clear sense of Fig. 12 The W-gp loss versus iterations of alternative training, during the conventional training process Fig. 13 The change of MS-SWD of the generator during the progressive training process. The MS-SWD converges after 160 k iterations of alternative training. The dashed black lines represent the points when the deeper blocks of 16×16, 32×32, and 64×64resolution were added to the generator their spatial relationship. Therefore, we propose a method to visualize the distribution of the generated and the training facies models in 2D space, based on MS-SWD and multidimensional scaling (MDS) approach [46]. MDS was also used to evaluate the GAN-generated facies models in [47] but based on the modified Hausdorff distance. We randomly generate 12,000 facies models using the generator and randomly select 12,000 training facies models. We randomly divide the generated and the training facies models into 300 sub-groups of size 40 each, respectively. The average SWD between every two of the 600 groups, is taken as the pairwise distance matrix used for multi-dimensional scaling. We use MDS to approximate the distribution of the 600 groups of the facies models in 2D space. This distribution can visually show the relationship between the generated facies models and the training facies models. In addition, in the same way, we also visualize the distribution of the training facies models, the distribution of the facies models generated from the generator trained in the conventional process, and the distribution of the facies models generated from the generator trained in the progressive process. This MDS plots visualization approach can help to compare the performances of GANs trained in different processes.

Facies proportion
For each facies model, we calculate a proportion of each facies type. The facies proportion distribution of the generated facies models should be close to that of the training facies models, for each facies type. To quantitatively measure this closeness, we define an error as the area between the facies proportion cdf of the generated facies models and that of the training facies models. This error is between 0 and 1. In this study, we sample 3000 generated facies models and 3000 training facies models, to compare their facies proportion distributions over different facies types.

Variogram
Variogram describes two-point spatial continuity of properties, including the geological facies. We sample 3000 generated and 3000 training facies models, calculate their

Channel sinuosity, channel width, and channel length
We use image processing algorithms to calculate channel sinuosity, channel width, and channel length (arc length) for each facies model. The channel sinuosity is defined as the ratio of channel length to the straight-line length between two end points. We used geodesic distance and Euclidean distance transforms algorithms to calculate the width, arc length, and straight-line length of channels in facies models. We randomly sample 200 generated facies models and 200 training facies models to compare their distributions of channel sinuosity, channel width, and channel length. The channel length reveals the connectivity of channels, to some extent.

Training datasets
In terms of the facies model training dataset used for GANs, researchers have used training images (TIs) [48], as well as object-based facies modeling methods (e.g., [29,42]) to construct their own training datasets. In addition, Nesvold and Mukerji [49] used satellite images as the training dataset. In our study, we synthesized two systematic facies training datasets in terms of different reservoir types: sinuous channel reservoirs and underground river-related karst caves.
The training sinuous channel facies models were synthesized in the commercial Petrel platform using mature objectbased modeling method. The facies types include channel complex facies (channel sand and channel bank) and inter-channel mud facies. The resolution of each of the 35,640 2-dimensional facies models is 64×64. Each facies model includes multiple channel complexes, and these channel complexes have similar global features (e.g., orientation, sinuosity, width, amplitude, etc.). During the synthesizing process, we tune the number, orientation, width, wavelength, and amplitude of the channel complexes, to create a variety of synthesized facies models. Figure 7 shows some facies model examples.
For the karst cave training models, we initially designed an object-based simulation method to produce 642 large karst cave realizations with resolution of 655 × 655 cells. These realizations vary in cave width, cave direction, and the number of cave branches (Fig. 8). There are totally two facies types: cave and non-cave facies. We randomly cropped these large realizations along karst caves to form 22,695 small 64×64-resolution karst cave facies models, which will be used as the training facies models in this study. Figure 8 shows the cropping process and some large and small karst cave facies model examples.

GANs used for sinuous channel facies modeling
The sinuous channel facies model training dataset was used for training the GANs in this case. First, we trained the GAN Fig. 15 Transition of the generated facies models with the same input latent vectors at different iterations of alternative training in the progressive process. Facies models are upsampled to 64×64, if they are smaller than 64×64. The first 4 columns are upsampled from 4×4, 8×8, 16×16, and 32×32 into 64×64. The quality of the generated facies models increases stably, especially at the transitions when the resolution is doubled. After 280 k iterations, the generated facies models are almost indistinguishable from the training data Fig. 16 The W-gp loss after different times of the alternative training, in the progressive manner Fig. 17 Comparison among the facies models generated from the conventionally trained generator (upper left), the facies models generated from the progressively trained generator (upper right), and the training facies models (upper center). All the generated facies models are all very similar to the training facies models in both realism and diversity, and only a small number of generated facies models ((a), (b), (c), and (d)) exhibit minor flaws of discontinuous channel complexes in the conventional process. The trend of the average SWD becomes smaller and finally flattens out after about 460 k iterations of alternative training (Fig. 9). Figure 10 shows the MDS plots of the distributions of the generated facies models and the training facies models in 2D space during the training process; after 420 k alternative training iterations, there is minor or no improvement in the relationship between the generated and the training facies models. By further visual inspection, we found that, after 520 k alternative training iterations, the generator can generate very realistic facies models, almost indistinguishable from the training data by human eyes. Figure 11 shows some random generated facies model examples after different iterations of the alternative training. We stopped training after 520 k iterations of alternative training, and kept that final generator for further evaluation and applications. This generator is called the conventionally trained generator hereafter. The total training time is 15.4 h using 2 GPUs (NVIDIA Tesla V100-PCIE-32GB), 10 CPUs, and 80G RAM in parallel. Figure 12 shows the W-gp loss (see Eq. (4)) versus training iterations; this loss is also called the critic loss in GAN research community.
Next, we trained another GAN with the same architecture but now using the progressive process described above. The MS-SWD converges after about 160 k iterations of alternative training (Fig. 13). Figure 14 further shows in the MDS plots that the generated facies model distribution gets closer to the training facies model distribution as the training progresses, and these two distributions almost totally mix after 160 k iterations. Further visual inspection indicates that, after 280 k iterations the generated facies models are almost indistinguishable from the training facies models (Fig. 15). Therefore, we stopped training after 280 k iterations of alternative training, and kept that generator for further assessment and applications. This generator is called the progressively trained generator hereafter. The total training time is 6.03 h using the same computational capabilities as for the Fig. 18 The change of the average SWDs of the conventional and progressive training, during the training process Fig. 19 The comparison of the training facies models, the facies models generated from the conventionally trained generator, and facies models generated from the progressively trained generator, in 2D space, using MDS plot. The facies models generated from the progressively trained generator is closer in distribution to the training facies models than the facies models generated from the conventionally trained generator conventionally trained GAN. Figure 15 also indicates that the quality of the generated facies models increases stability during the training, especially at the time points when the output resolution is doubled. At these time points, more layers are involved in the generator and the discriminator, so the generator and the discriminator become more robust. The generated facies models after these times points (e.g., the fourth column in Fig. 15) are based on the generated facies models before these times points (e.g., the third column in Fig. 15). The newly added deeper layers in the generator learns more detailed features of the facies models based on the features learned by the shallower layers. Figure 16 shows the W-gp loss (see (4)) versus alternative training iterations.
We compared the conventionally trained generator and the progressively trained generator using visual inspection, MS-SWD, MDS plots, facies proportion, the variogram, and the channel sinuosity, width, and length metrics.
The conventionally and progressively trained generators were both applied to generate 20,000 facies models with one GPU (NVIDIA Tesla V100-PCIE-32GB), respectively, and it takes around 3.2 s for each of the two trained generators. We manually compared these generated facies models with the training facies models, with respect to realism and diversity. The facies models generated from both generators are all very realistic and diversified, and almost indistinguishable from the training facies models, in spite of some minor flaws of discontinuous channel complexes in a small number of generated facies models (Fig. 17). It is difficult to decide which generator is superior based on visual inspection alone. Figure 18 shows the change of the average SWDs during the conventional and progressive training against the baseline. The average SWD of the conventional training process flattens out slowly, and is farther away from the baseline. Figure 10 indicates that, in the conventional training process, although the initial generated facies model distribution is not far from the training facies model distribution, the two distributions cannot get very close even after many iterations of alternative training. On the other hand, the average SWD of the progressive training converges very quickly, and the convergence value is very close to the baseline (Fig. 18). The progressive training process starts from low-resolution output and progressively increases the resolution. Hence during the initial 2 h of training, the average SWD of the progressive training is much larger than that of the conventional training process (Fig. 18), and the generated facies model distribution is very far from the training facies model distribution (Fig. 14). During the progressive training, whenever the resolution of the output is doubled, there is a large improvement in the average SWD and in the relationship between the generated and the training facies model distributions.
The final generated facies model distribution of the progressive training almost totally mixes with the training facies model distribution (last plot in Fig. 14).
The average SWD of the progressively trained generator (6.81) is much closer to the baseline (4.69) than the average SWD of the conventionally trained generator (12.74) (Fig. 18). Figure 19 uses MDS plot to compare the distribution of the training facies models, and the distributions of the facies models generated from both the conventionally trained and the progressively trained generators, in 2D space. It is clear that the facies models generated from the progressively trained generator is closer to the training facies models than the facies models generated from the conventionally trained generator (Fig. 19, last plot in Fig. 10, and last plot in Fig. 14). In addition, the progressively trained generator is trained in only 39.2% (6.03 h) of the time required to train the conventionally trained generator (15.4 h). We also evaluated the facies proportion distributions of both the conventionally trained generator and the progressively trained generator, by comparing to the facies proportion distributions of the training data (Fig. 20). The cumulative distribution functions (cdfs) of both the trained generators are very close to the cdfs of the training data, over all facies types. We calculated the facies proportion errors in cdf for the two trained generators, over different facies types ( Table 2). The progressively trained generator has slightly lower errors than the conventionally trained generator, over all facies types.
We calculated the variograms of the facies models generated from the conventionally trained generator, the progressively trained generator, and the training facies models, along different directions (Fig. 21). We further compared the average variograms of the generators with that of the training facies models (Fig. 22). The average variograms of both generators are very close to that of the training data, over all directions. The average variograms of the two generators almost overlap; this means that the two generators have similar performance (and flaws) in capturing two-point spatial continuity, to some extent.
We calculated the cdf distributions of channel sinuosity, channel width, and channel length for the randomly sampled training facies models, facies models produced from conventionally trained generator, and facies models from progressively trained generator (Fig. 23). The cdfs of both trained generators are almost equally close to the cdf of the training data, for channel sinuosity, channel width, and channel length. Channel length reveals the connectivity of channels, thus the close channel length distributions in Fig. 23 indicates that, both trained generators produce similar channel connectivity as in training facies models. Looking at the details of the channel length distribution we see that the conventionally trained generator (blue triangles) produces models with a slightly larger proportion of shorter channel lengths as compared to the distribution of channel length from the training set (red circles). This implies that the channel connectivity is slightly poorer for the models from the conventionally trained generator. The models from the progressively trained generator (black squares) show a better match to the training set distribution.
To sum up, the progressively trained generator and the conventionally trained generator both have very good performances on all metrics. Once they are trained, it takes 0.16 milliseconds for each generator to simulate a 64×64-resolution facies model. The progressively trained generator behaves especially better than the conventionally trained generator on the MS-SWD metric, MDS plots, and the reduced training time.
We further compare the facies models generated from the progressively trained generator with their closest counterparts in the training dataset (Fig. 24). The closest training facies models are found based on the pixel-to-pixel Euclidean distance. We can see from Fig. 24 that, although the generated facies models are very similar to their counterparts in the training dataset, there are still large differences between therm. Thus, we can conclude that, the generated facies models are not just a naïve copy of the training dataset; instead, they are produced based on the geological pattern knowledge learned from the training dataset.

GANs used for karst cave facies modeling
Similar to the previous case, we trained two generators with conventional and progressive processes, respectively, using the karst cave training dataset. Based on visual inspection and change of MS-SWD metrics, the conventionally trained generator is obtained after 190 k iterations of alternative training. The training time is 4.7 h using 2 GPUs (NVIDIA Tesla V100-PCIE-32GB), 10 CPUs, and 80G RAM in parallel. Similarly, the progressively trained generator is obtained after Fig. 22 The comparison of the average variograms for the conventionally trained generator, the progressively trained generator, and the training facies models, along different directions (i.e., 0, 45, 90, and 135 degrees). The average variograms of two generators almost overlap, and they are both very close to that of the training facies models Fig. 23 Channel sinuosity (left), channel width (middle), and channel length (right) distributions (cdf's) of the randomly sampled training facies models, facies models produced from conventionally trained generator, and facies models from progressively trained generator 240 k iterations of alternative training, and takes 4.4 h with the same computational resources as the conventional process.
We manually compared the karst cave facies models produced by the two trained generators with the training karst cave facies models in realism and diversity (Fig. 25). All the generated facies models are very realistic and diversified, and almost indistinguishable from the training facies models. The average SWD of the progressively trained generator, conventionally trained generator, and the baseline computed from two groups of training facies models are 6.93, 11.75, and 4.69, respectively. That the average SWD of progressively trained generator is much closer to the baseline than the conventionally trained generator, indicates the superiority of the progressive training to some extent. Figure 26 shows the MDS plot to compare the distributions of the training facies models with the distributions of the facies models produced from the two generators, in 2D space. The facies models generated from the progressively trained generator is closer to the training facies models than those generated from the conventionally trained generator. We further compared the cdf distributions of the karst cave proportion for randomly sampled training facies models and facies models produced from the two trained generators (Fig. 27). The cdf of the progressively trained generator is slightly closer to the cdf of the training data than the conventionally trained generator.
In this case, compared to the conventional training process, the progressive training process only saves about 6% of the training time, while the saved training time in previous case is about 60%. The reason might be that, the karst cave distribution patterns to be learned in this case is easier than the sinuous channel distribution patterns in the previous case, given that generally only one or two karst caves exist in one facies model and that the caves are more aligned along the north-south direction (Fig. 8). With easier feature knowledge to be Fig. 24 Comparison between generated facies models with their closest counterparts in the training dataset Fig. 25 Comparison among the facies models generated from the conventionally trained generator (left), the facies models generated from the progressively trained generator (right), and the training facies models (middle). All the generated facies models are very similar to the training facies models in both realism and diversity learned, even the direct conventional training process can learn good features quite efficiently, thus the progressive training process cannot show much advantage in at least the training time.

Conclusions
Different from the conventional Generative Adversarial Network (GAN) training process, in the progressive GAN training process, or the progressive growing of GANs, GANs are trained layer by layer, allowing features to be learned by the generator from large coarse scales to fine scales. We trained GANs for geological facies modeling using both the progressive training workflow and the conventional training workflow, and assessed the two trained generators. We used visual inspection, multi-scale sliced Wasserstein distance (MS-SWD), multi-dimensional scaling (MDS) plots, facies proportion, variogram, channel sinuosity, channel width, and channel length as the evaluation metrics for the two trained generators. MS-SWD reveals the realism and variety of the generated facies models over different scales, compared to the training facies models; MS-SWD was also combined with MDS to visualize the distributions of the training and the generated facies models in 2D space. The two trained generators both have very good performances on all metrics. The progressively trained generator behaves especially better than the conventionally trained generator on the MS-SWD metric, MDS plots, and the GAN training time. The training time for the progressively trained generator can be only 39.2% of that for the conventionally trained generator. However, if the features to be learned are simple, then this advantage of the progressive process on training time might be smoothed. Both trained generators can reproduce very good facies proportion, variograms, and channel sinuosity, width, and length that are similar to the training data. Once the generators are trained, it takes 0.16 milliseconds for each generator to simulate a new 64×64-resolution facies model. We also built two large facies model training datasets for this study: one is sinuous channel reservoir type, and the other is karst cave reservoir type. These Fig. 26 The comparison of the training facies models, the facies models generated from the conventionally trained generator, and facies models generated from the progressively trained generator, in 2D space, using MDS plot. The distribution of facies models generated from the progressively trained generator is much closer to the training facies models than the facies models generated from the conventionally trained generator Fig. 27 Karst cave proportion distributions (cdf's) of the facies models produced from the conventionally trained generator, the progressively trained generator, and the training data datasets can also be used for future facies modeling-related works. Future work involves extending the progressive growing of GANs into conditional facies modeling and 3D applications.