Different videos deserve different token assignments: Compressing videos adaptively according to their content can improve both efficiency and quality.
Two tokenizers are compared. One (traditional) encodes videos into fixed-length token sequences, while the other (ours) encodes videos into adaptive-length token sequences. Our EVATok adopts more reasonable assignments coherent to the video content. For easier examples, EVATok achieves better reconstruction quality with less tokens; for challenging examples, it can use more tokens to obtain much better reconstruction quality.
Overall, EVATok achieves better reconstruction and downstream generation quality compared to previous methods, while saving at leat 24.4% tokens. EVATok can be both faster and better because the structure of the token sequences is optimized.