In order to use collected data we had to transformed it in a way that is suitable for neural networks. That meant that letters representing amino acids had to be translated to numeric representation. As a result, each amino acid has been encoded into embedding vector of 8 physicochemical attributes (Kawashima, Ogata and Kanehisa, 1998). We opted for using embedding vector due to its memory efficiency in comparison to one-hot encoding technique. In one-hot encoding, the amino acid would be turned in the vector of length 20 with 19 zeros and single 1, whereas with embedding vector, we can use vector which is 2.5 times smaller (See Figure 4). What is more, neural network also received experimentally gathered properties about each individual amino acid.
Training process of ProteinGAN
Generator receives a set of numbers from random gaussian distribution called noise which is passed through pre-defined number of filters that contain trainable parameters. These filters are arranged in blocks called residual blocks. After each block the length of the input increases until it matches the length of a real protein (analogy to increasing resolution in image generation) (Figure 2). The last layer ensures that the range of values and shape matches original data.
Figure 5. Visualization of protein generation
Discriminator receives generated and real sequences and passes them through the network containing same size filters stacked in reversed order compared to generator (in order to not overpower one over another). At the end, discriminator arrives to a single number for each sequence that corresponds to the confidence level of discriminator that a particular resembles natural protein sequences.
Using the scores from discriminator, each part of the GAN is evaluated using loss function.
Hinge loss where D - function of discriminator that returns a single number for each sequence, G - a function of generator that takes noise as an input and returns sequence of amino acids. N - the number of examples looked in one step.
Intuitively, this means that discriminator will do well if it assigns low scores to generated examples and high to real ones. On the other hand, generator is penalized if generated examples are scored low. These scores drive the direction of how parameters are being tweaked to improve the scores of discriminator and generator separately.
As generator and discriminator have opposite goals, they compete against each other in so called mini-max game. Forward pass and backpropagation is repeated until the generated sequences are not improving further or is hardly distinguishable from real ones.
To train ProteinGAN we used Google cloud instance with NVIDIA Tesla P100 GPU (16GB). Training took approximately 60h for each class which contained at least 20000 unique proteins. Entire network contained ~4M trainable parameters. For optimization Adam optimizer has been chosen with learning rates of 0,0001 for discriminator and generator (β1 = 0, β2 = 0,9). Both discriminator and generator were trained the same amount of steps.
ProteinGAN Architecture details
ProteinGAN networks are comprised of residual blocks that showed the best performance in ImageNet challenges. They are also a widely adopted in various implementation of GANs. Each block in discriminator contains 3 convolution layers with filter size of 3x3. The generator residual blocks consist of two deconvolution layers (transposed convolution) and one convolution block with the same filter size of 3x3 (see Figure 6).
Figure 6. Residual Blocks
We have experimented with 3 up-sampling techniques: Nearest Neighbor, Transposed Convolutions (deconvolutions) and Sub-Pixel Convolutions (Shi et al., 2016). After conducting numerous experiments, we have chosen Transposed Convolutions as the best suited for dealing with proteins.
Figure 7: Architecture of generator and discriminator
The addition of Dilation into ProteinGAN
Convolution filters are very good at detecting local features, but it has limitations when it comes to long distance relationship. As a result, a lot of deep learning algorithms uses RNN (Recurrent Neural Network) approach when it comes to sequences. However, it has been showed that convolution filters with dilation outperforms RNN (Bai, Kolter and Koltun, 2018). The idea of dilation is to increase receptive field without increasing the number of parameters by introducing gaps into convolution kernels (Figure 8). Dilation rate was applied to one convolution filter in each residual block. The dilation rate was increased by 2 in each consecutive block. In this way by the last layer of the network, filters had large enough receptive field to learn long-distance relationships.
Figure 8: (a) dilation rate equals 1 (standard convolution), (b) dilation rate equals 2, (c) dilation rate equals 4. Red dots are where 3x3 filter is applied, colored squares show receptive field given previous convolutions. Source: (Yu and Koltun, 2016)
Self-Attention
Different areas of protein have different responsibilities in overall protein behaviour. In order to for network to capture this, self-attention mechanism (Zhang et al., 2018) has been implemented. To put it simply, it is a number of layers that highlights different areas of importance across the entire sequence. There are 64 such filters in ProteinGAN and ReactionGAN implementations.
Figure 9: Source: (Zhang et al., 2018)
Spectrum normalization
One of the biggest issue then implementing GAN is the stability of the training. In practise a lot of GAN implementations suffer from diminishing (very small) or exploding (enormously big) gradients. In both cases GANs are not capable to learn patterns in data successfully. To mitigate this issue, spectrum normalization technique (Miyato et al., 2018) was used. It is regularization method to constrain the Lipschitz constant of the weights. It has been shown that it works successfully even for a large and complex datasets such as ImageNet (Brock, Donahue and Simonyan, 2018).
Mode collapse
Given original GAN formulation, there is nothing to prevent generator from generating a single, very realistic example to fool the discriminator. Such scenario is known as mode collapse. It happens when generator learns to ignore the input (random numbers). Logically, it is an efficient way for generator to start generating examples that could fool discriminator. However, it is not desirable behaviour and it eventually cripples the training as discriminator can easily remember generated examples. While working with proteins, we observed that this issue is even more severe in comparison to images. In scientific community, a lot of different approaches were proposed to address the mode collapse issue: Unrolled GAN (Metz et al., 2017), Dual Discriminator (Nguyen et al., 2017), Mini batch Discriminator (Salimans et al., 2016) to name a few. We preferred Mini Batch Discriminator approach due to its simplicity and minimal overhead. Mini Batch Discriminator works an extra layer in the network that computes the standard deviation across the batch of examples (batch contains only real, or only fake sequences). If the batch contains a small variety of examples standard deviation will be low and discriminator will be able to use this information to lower the final score for each example in the batch. ProteinGAN and ReactionGAN follow the approach proposed by authors of Progressively growing GAN (Karras et al., 2018).
References
Bai, S., Kolter, J. and Koltun, V. (2018). An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. [online] Arxiv.org. Available at:
Brock, A., Donahue, J. and Simonyan, K. (2018). Large Scale GAN Training for High Fidelity Natural Image Synthesis. [online] Arxiv.org. Available at: https://arxiv.org/abs/1809.11096 [Accessed 14 Oct. 2018].
Karras, T., Aila, T., Laine, S. and Lehtinen, J. (2018). Progressive Growing of GANs for Improved Quality, Stability, and Variation. [online] Arxiv.org. Available at: https://arxiv.org/abs/1710.10196 [Accessed 14 Oct. 2018].
Kawashima, S., Ogata, H., & Kanehisa, M. (1999). AAindex: Amino acid index database. Nucleic Acids Research, 27(1), 368–369. http://doi.org/10.1093/nar/27.1.368
https://arxiv.org/abs/1803.01271 [Accessed 13 Oct. 2018].
Metz, L., Poole, B., Pfau, D. and Sohl-Dickstein, J. (2017). Unrolled Generative Adversarial Networks. [online] Arxiv.org. Available at: https://arxiv.org/abs/1611.02163 [Accessed 14 Oct. 2018].
Miyato, T., Kataoka, T., Koyama, M. and Yoshida, Y. (2018). Spectral Normalization for Generative Adversarial Networks. [online] Arxiv.org. Available at: https://arxiv.org/abs/1802.05957 [Accessed 14 Oct. 2018].
Nguyen, T., Le, T., Vu, H. and Phung, D. (2017). Dual Discriminator Generative Adversarial Nets. [online] Arxiv.org. Available at: https://arxiv.org/abs/1709.03831 [Accessed 14 Oct. 2018].
Robert, X., & Gouet, P. (2014). Deciphering key features in protein structures with the new ENDscript server. Nucleic Acids Research, 42(W1), 320–324. http://doi.org/10.1093/nar/gku316
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A. and Chen, X. (2016). Improved Techniques for Training GANs. [online] Arxiv.org. Available at: https://arxiv.org/abs/1606.03498 [Accessed 14 Oct. 2018].
Yu, F. and Koltun, V. (2016). Multi-Scale Context Aggregation by Dilated Convolutions. [online] Arxiv.org. Available at: https://arxiv.org/abs/1511.07122 [Accessed 14 Oct. 2018].
Zhang, H., Goodfellow, I., Metaxas, D. and Odena, A. (2018). Self-Attention Generative Adversarial Networks. [online] Arxiv.org. Available at: https://arxiv.org/abs/1805.08318 [Accessed 14 Oct. 2018].