Vol. 4 No. 1 (2025)
Articles

Enhancing Vision Transformers for Image Generation: A Hierarchical GAN with Triplet Attention and Consistency Regularization

Published 2025-01-30

How to Cite

Whitman, J. (2025). Enhancing Vision Transformers for Image Generation: A Hierarchical GAN with Triplet Attention and Consistency Regularization. Journal of Computer Technology and Software, 4(1). https://doi.org/10.5281/zenodo.14832466

Abstract

This paper proposes a novel hierarchical generator adversarial network based on Vision Transformers (ViT) for unconditional image generation. To address common challenges such as structural inconsistency and unstable training in GANs, we introduce the Triplet Attention mechanism within the generator, enhancing the structural soundness of generated images without increasing the model’s parameter size. Additionally, a consistency regularization term is integrated into the loss function, improving the training stability and robustness to noise while mitigating overfitting. The effectiveness of the proposed method is demonstrated through extensive experiments on the CIFAR-10 and STL-10 datasets, where our framework outperforms TransGAN and other CNN-based GANs in both FID and IS metrics. Despite the simplicity of our architecture, which contains only three transformer layers, we achieve promising results, laying the groundwork for further enhancements in high-resolution image generation.