LVD-2M:
A Long-take Video Dataset with Temporally Dense Captions

NeurIPS D&B 2024
Tianwei Xiong1,*, Yuqing Wang1,*, Daquan Zhou2,†, Zhijie Lin2
Jiashi Feng2, Xihui Liu1,
1The University of Hong Kong, 2ByteDance
*Equal contribution. †Project lead. Corresponding author.

Demo Video

Abstract

The efficacy of video generation models heavily depends on the quality of their training datasets. Most previous video generation models are trained on short video clips, while recently there has been increasing interest in training long video generation models directly on longer videos. However, the lack of such high-quality long videos impedes the advancement of long video generation. To promote research in long video generation, we desire a new dataset with four key features essential for training long video generation models: (1) long videos covering at least 10 seconds, (2) long-take videos without cuts, (3) large motion and diverse contents, and (4) temporally dense captions. To achieve this, we introduce a new pipeline for selecting high-quality long-take videos and generating temporally dense captions. Specifically, we define a set of metrics to quantitatively assess video quality including scene cuts, dynamic degrees, and semantic-level quality, enabling us to filter high-quality long-take videos from a large amount of source videos. Subsequently, we develop a hierarchical video captioning pipeline to annotate long videos with temporally-dense captions. With this pipeline, we curate the first long-take video dataset, LVD-2M, comprising 2 million long-take videos, each covering more than 10 seconds and annotated with temporally dense captions. We further validate the effectiveness of LVD-2M by fine-tuning video generation models to generate long videos with dynamic motions. We believe our work will significantly contribute to future research in long video generation.

Data Pipeline

Framework

Video filtering process. Our video filtering process employs multiple criteria to select high-quality, dynamic, and long-take videos from four source datasets.

Caption Framework

Hierarchical video captioning process. First, we split the long video into 30-second clips and compose them into image grids. Then, we use the LLaVA-1.6-34B model to generate captions for each image grid. Finally, we use the Claude3-Haiku model to refine and merge these captions into the final complete caption for the whole video.

Logo LVD-2M Dataset Statistics

Framework

Comparing to Other Datasets

Framework

We demonstrate 5 metrics, including the long-take rate measured by human raters, caption length for the average caption word count, dynamic degree which is the average of human rated 1$\sim$3 dynamic score, median video clip length and the average optical flow magnitude.

Logo LVD-2M for Finetuning
Video Generation Models

Extending a Diffusion-based T2V Model
for Longer Range on LVD-2M

Finetuning Dataset Subject Consistency Background Consistency Temporal Flickering Motion Smoothness Dynamic Degree Aesthetic Quality Imaging Quality Object Class
WebVid-10M 95.81% 98.02% 98.00% 97.87% 20.00% 58.02% 72.63% 76.95%
LVD-2M 96.12% 96.92% 97.44% 98.43% 28.06% 57.56% 70.72% 86.93%
Finetuning Dataset Multiple Objects Human Action Color Spatial Relationship Scene Appearance Style Temporal Style Overall Consistency
WebVid-10M 26.02% 61.40% 75.51% 51.06% 29.19% 20.12% 19.34% 21.43%
LVD-2M 22.76% 76.20% 79.32% 51.40% 32.95% 20.60% 20.25% 21.29%
VBench evaluation results for two T2V diffusion models finetuned on LVD-2M and WebVid-10M separately, from the same base model. Finetuning on LVD-2M leads to more than 8% performance improvement than WebVid-10M for Dynamic Degree, Object Class and Human Action.
Framework
After finetuning a T2V diffusion model on LVD-2M, the videos are more dynamic, and the actions and objects in the videos are more reasonable, in contrast to finetuning on WebVid-10M.

Finetuning a Diffusion I2V Model
and a LM-based T2V Model

user study
Human evaluation of generated videos by baseline v.s. fine-tuned models. We finetune both a diffusion-based I2V model and a LM-based T2V model on LVD-2M. Compared to the pretrained model, the finetuned models can generate more dynamic videos.

Logo Gallery of Video-Text Pairs

BibTeX


@article{xiong2024lvd2m,
      title={LVD-2M: A Long-take Video Dataset with Temporally Dense Captions}, 
      author={Tianwei Xiong and Yuqing Wang and Daquan Zhou and Zhijie Lin and Jiashi Feng and Xihui Liu},
      year={2024},
      journal={arXiv preprint arXiv:2410.10816}
}