GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation

🔈News

[2025-06-26] GigaTok is accepted by ICCV 2025!

[2025-04-14] The research paper, codebase and model checkpoints of GigaTok are released!

Scaling Tokenizers for
Reconstruction and Autoregressive Generation

🤔Reconstruction vs. generation dilemma: larger tokenizers can bring better reconstruction, but may lead to worse downstream AR generation.
🚀GigaTok breaks through this reconstruction vs. generation dilemma.

Up (reconstruction): Naively scaling visual tokenizers achieves better reconstruction (red line). GigaTok presents superior reconstruction quality after scaling (blue line and qualitative results).
Bottom (generation): Naively scaling visual tokenizers leads to worse AR generation. In contrast, GigaTok achieves better performance for both reconstruction and generation as tokenizers scale up.

The 2.9B GigaTok achieves SOTA autoregressive image generation with a 1.4B AR model on ImageNet 256×256 resolution.

Abstract

In autoregressive (AR) image generation, visual tokenizers compress images into compact discrete latent tokens, enabling efficient training of downstream autoregressive models for visual generation via next-token prediction. While scaling visual tokenizers improves image reconstruction quality, it often degrades downstream generation quality—a challenge not adequately addressed in existing literature. To address this, we introduce GigaTok, the first approach to simultaneously improve image reconstruction, generation, and representation learning when scaling visual tokenizers. We identify the growing complexity of latent space as the key factor behind the reconstruction vs. generation dilemma.

To mitigate this, we propose semantic regularization, which aligns tokenizer features with semantically consistent features from a pre-trained visual encoder. This constraint prevents excessive latent space complexity during scaling, yielding consistent improvements in both reconstruction and downstream autoregressive generation. Building on semantic regularization, we explore three key practices for scaling tokenizers: (1) using 1D tokenizers for better scalability, (2) prioritizing decoder scaling when expanding both encoder and decoder, and (3) employing entropy loss to stabilize training for billion-scale tokenizers. By scaling to 3 billion parameters, GigaTok achieves state-of-the-art performance in reconstruction, downstream AR generation, and downstream AR representation quality.

Main Contributions

We identify that the reconstruction vs. generation dilemma in tokenizer scaling stems from increased latent space complexity in larger tokenizers. To address this, we propose semantic regularization, effectively mitigating the dilemma and enabling tokenizer scaling.
We explore best practices for scaling tokenizers, including 1D tokenizer with hybrid CNN-Transformer architecture, asymmetric encoder-decoder scaling, and entropy loss for billion-scale tokenizers.
Our GigaTok is the first tokenizer scaled to 3B, achieving state-of-the-art reconstruction, downstream AR generation, and downstream AR representation on ImageNet.

Pilot Study for
Reconstruction vs. Generation Dilemma

AR Probing

A complete evaluation of a tokenizer should contain rFID and downstream gFID. This is why we introduce "AR Probing". Every time we evaluate a tokenizer, we train a small 111M AR model on the tokenizer checkpoint, and evaluate the gFID as well as cross-entropy validation loss of the AR model.

Latent Space Complexity is Key to Tokenizer Scaling

Scaling trend for vanilla 1D tokenizers. As the model size increases, the reconstruction quality of vanilla tokenizers improves but the downstream AR Probing gFID consistently degrades. The increasing AR Probing validation loss indicates that scaling vanilla tokenizers results in a more complex latent space, making it difficult for AR models to learn effectively.

GigaTok

Architecture & Semantic Regularization

GigaTok architecture and semantic regularization. Top: We use a hybrid CNN-Transformer design for our visual tokenizer. The transformer layers are implemented with ViT for 2D tokenizer and Q-Former for 1D tokenizer. Bottom: We use a frozen DINOv2 image encoder for semantic regularization.

Entropy Loss for Large Tokenizers

Training curves for 2.9B XL-XXL tokenizers with and without entropy loss. A 2.9B tokenzier does not converge without entropy loss. The entropy loss encourages high codebook usage and stabilizes training loss.

Experiments

Semantic Regularization for Model Scaling

Scaling trends of tokenizers for reconstruction, downstream generation and representation quality with and without semantic regularization. By semantic regularization, GigaTok resolves the reconstruction vs. generation dilemma for tokenizer scaling in contrast to the vanilla version without semantic regularization. Moreover, GigaTok consistently improves the representation quality of downstream AR models by scaling up visual tokenizers. Note that in the last two figures, the red and blue curves correspond to different scales on the y-axis.

Visualization of Latent Space

Visualization of tokenizer features with and without semantic regularization. We compute PCA among the tokenizer features of a group of images of the same "golden retriever" class and visualize the first 3 PCA components. We observe that the latent space of vanilla tokenizers shows inconsistent features both within a single image or across multiple semantically similar images. In contrast, GigaTok encodes images with semantic consistency and thus reduces the latent space complexity for AR models.

Asymmetric Design for Better Performance

The results for scaling encoder/decoder. Prioritizing the scaling of decoders benefits downstream generation more than scaling encoders (S-B v.s. B-S). But scaling encoders can still bring significant improvements (S-L v.s. B-L).

Scalability: 1D vs. 2D Tokenizers

Left: 1D architecture of GigaTok with Q-Former. Right: 2D architecture with ViT blocks.

Scalability comparison for 1D and 2D tokenizers. Using the same training setting, 1D tokenizers shows better reconstruction (rFID) and downstream representation quality (AR Probing: Lin Acc.). For downstream generation (gFID), 1D tokenizers present a steeper improving trend than 2D tokenizers.

Main Results

Comparison for tokenizers and downstream generation models on ImageNet 256x256. For gFID, we present the lowest value between w/ or w/o CFG scenarios. ^†: Training set includes data besides ImageNet. ^‡: Using frozen DINO for discriminator, which largely improves rFID. ^★: Without classifier-free-guidance.

BibTeX

            
@article{gigatok,
    title={GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation},
    author={Tianwei Xiong and Jun Hao Liew and Zilong Huang and Jiashi Feng and Xihui Liu},
    journal={arXiv preprint arXiv:2504.08736},
    year={2025}
}

Acknowledgment

The authors sincerely thank Qihang Yu and Liang-Chieh Chen for their valuable discussions during the development of GigaTok.