EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation

1The University of Hong Kong, 2ByteDance Seed
*Work partly done as an Intern at ByteDance Seed. Corresponding author.
CVPR 2026

🔈News

[2026-03-13] The research paper, codebase and model checkpoints of EVATok are released!

[2026-02-21] EVATok is accepted by CVPR 2026!

Adaptive Length Video Tokenization
Improves Both Efficiency and Quality

Different videos deserve different token assignments: Compressing videos adaptively according to their content can improve both efficiency and quality.

Method Comparison

Two tokenizers are compared. One (traditional) encodes videos into fixed-length token sequences, while the other (ours) encodes videos into adaptive-length token sequences. Our EVATok adopts more reasonable assignments coherent to the video content. For easier examples, EVATok achieves better reconstruction quality with less tokens; for challenging examples, it can use more tokens to obtain much better reconstruction quality.


Method Comparison

Overall, EVATok achieves better reconstruction and downstream generation quality compared to previous methods, while saving at leat 24.4% tokens. EVATok can be both faster and better because the structure of the token sequences is optimized.

Abstract

Autoregressive (AR) video generative models rely on video tokenizers that compress pixels into discrete token sequences. The length of these token sequences is crucial for balancing reconstruction quality against downstream generation computational cost. Traditional video tokenizers apply a uniform token assignment across temporal blocks of different videos, often wasting tokens on simple, static, or repetitive segments while underserving dynamic or complex ones. To address this inefficiency, we introduce EVATok, a framework to produce Efficient Video Adaptive Tokenizers. Our framework estimates optimal token assignments for each video to achieve the best quality-cost trade-off, develops lightweight routers for fast prediction of these optimal assignments, and trains adaptive tokenizers that encode videos based on the assignments predicted by routers. We demonstrate that EVATok delivers substantial improvements in efficiency and overall quality for video reconstruction and downstream AR generation. Enhanced by our advanced training recipe that integrates video semantic encoders, EVATok achieves superior reconstruction and state-of-the-art class-to-video generation on UCF-101, with at least 24.4% savings in average token usage compared to the prior state-of-the-art LARP and our fixed-length baseline.

Main Contributions

  • A four-stage framework for efficient video adaptive tokenization, featuring a router that provides optimal budget assignments during training and inference of tokenizers.
  • Proxy reward: a novel metric utilizing a variable length tokenizer to identify optimal assignments for each video.
  • Extensive experiments showing that content-adaptive video tokenization can surpass fixed-length baselines, achieving superior performances in reconstruction and downstream AR generation with fewer tokens.

EVATok

Four-Stage Framework

pipeline

Four-stage framework for adaptive video tokenizer training. Stage 1 trains a proxy tokenizer to reconstruct videos under all candidate assignments. Stage 2 applies the proxy tokenizer to compute proxy rewards for all candidate assignments across videos from a dataset. It identifies the assignments with maximum proxy rewards to curate a classification dataset of videos and their optimal assignments. Stage 3 trains a router on the curated dataset to predict the optimal assignments for videos. Stage 4 trains the final tokenizer from scratch, with the router determining the assignment for each input video during training.

Adaptive Length Video Tokenizer Architecture

adaptive length video tokenizer architecture

Architecture of 1D variable-length video tokenizer for EVATok. The input video is spatio-temporally patchified into 3D embeddings. According to a given assignment a, 1D variable-length query embeddings are initialized from these 3D embeddings. After Q-Former encoding and quantization, 1D discrete tokens are produced. Finally, 3D queries are initialized to reconstruct the video frames from the 1D tokens.

Experiments

Quality-Cost Trade-off Curves

trade off curve

Quality-cost trade-off curves for different assignment strategies. By adaptively assigning token budgets to different temporal blocks across various videos, our max-proxy-reward strategy~(green series) achieves superior performance under various overall budgets compared to the typical fixed uniform token assignment approach~(red series). The router-based assignment~(blue series) delivers performance close to that of the max-proxy-reward strategy on both WebVid and UCF datasets~(the latter unseen during router training).

Video Semantic Encoder is Important for Video Tokenizer

semantic encoder ablation

Ablation study for video representation alignment and video semantic discriminator. Removing either design will lead to degradation in rFVD and downstream gFVD. Representation alignment guidance and discriminative feedback from the semantic encoder are both crucial for video tokenizer training.

Main Results

system level comparison

Visualization of tokenizer features with and without semantic regularization. We compute PCA among the tokenizer features of a group of images of the same "golden retriever" class and visualize the first 3 PCA components. We observe that the latent space of vanilla tokenizers shows inconsistent features both within a single image or across multiple semantically similar images. In contrast, GigaTok encodes images with semantic consistency and thus reduces the latent space complexity for AR models.

Video Examples

Below we show reconstruction comparisons (Original vs EVATok), frame prediction results (Condition vs Prediction) and class-to-video generation samples.

Reconstruction

Frame Prediction

Class-to-Video Generation

BibTeX

            
              @article{xiong2025evatok,
                title={EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation},
                author={Xiong, Tianwei and Liew, Jun Hao and Huang, Zilong and Lin, Zhijie and Feng, Jiashi and Liu, Xihui},
                journal={arXiv preprint arXiv:2603.12267},
                year={2026}
              }