LVD-2M:
A Long-take Video Dataset with Temporally Dense Captions

NeurIPS D&B 2024

Tianwei Xiong^1,*, Yuqing Wang^1,*, Daquan Zhou^2,†, Zhijie Lin²
Jiashi Feng², Xihui Liu^1,
¹The University of Hong Kong, ²ByteDance
*Equal contribution. †Project lead. Corresponding author.

arXiv Code Video

🔈News

[2024.10.15] Our research paper, project page and LVD-2M dataset are released!

Demo Video

Abstract

The efficacy of video generation models heavily depends on the quality of their training datasets. Most previous video generation models are trained on short video clips, while recently there has been increasing interest in training long video generation models directly on longer videos. However, the lack of such high-quality long videos impedes the advancement of long video generation. To promote research in long video generation, we desire a new dataset with four key features essential for training long video generation models: (1) long videos covering at least 10 seconds, (2) long-take videos without cuts, (3) large motion and diverse contents, and (4) temporally dense captions. To achieve this, we introduce a new pipeline for selecting high-quality long-take videos and generating temporally dense captions. Specifically, we define a set of metrics to quantitatively assess video quality including scene cuts, dynamic degrees, and semantic-level quality, enabling us to filter high-quality long-take videos from a large amount of source videos. Subsequently, we develop a hierarchical video captioning pipeline to annotate long videos with temporally-dense captions. With this pipeline, we curate the first long-take video dataset, LVD-2M, comprising 2 million long-take videos, each covering more than 10 seconds and annotated with temporally dense captions. We further validate the effectiveness of LVD-2M by fine-tuning video generation models to generate long videos with dynamic motions. We believe our work will significantly contribute to future research in long video generation.

Data Pipeline

Video filtering process. Our video filtering process employs multiple criteria to select high-quality, dynamic, and long-take videos from four source datasets.

Hierarchical video captioning process. First, we split the long video into 30-second clips and compose them into image grids. Then, we use the LLaVA-1.6-34B model to generate captions for each image grid. Finally, we use the Claude3-Haiku model to refine and merge these captions into the final complete caption for the whole video.

LVD-2M Dataset Statistics

Comparing to Other Datasets

We demonstrate 5 metrics, including the long-take rate measured by human raters, caption length for the average caption word count, dynamic degree which is the average of human rated 1$\sim$3 dynamic score, median video clip length and the average optical flow magnitude.

LVD-2M for Finetuning
Video Generation Models

Extending a Diffusion-based T2V Model
for Longer Range on LVD-2M

Finetuning Dataset	Subject Consistency	Background Consistency	Temporal Flickering	Motion Smoothness	Dynamic Degree	Aesthetic Quality	Imaging Quality	Object Class
WebVid-10M	95.81%	98.02%	98.00%	97.87%	20.00%	58.02%	72.63%	76.95%
LVD-2M	96.12%	96.92%	97.44%	98.43%	28.06%	57.56%	70.72%	86.93%

Finetuning Dataset	Multiple Objects	Human Action	Color	Spatial Relationship	Scene	Appearance Style	Temporal Style	Overall Consistency
WebVid-10M	26.02%	61.40%	75.51%	51.06%	29.19%	20.12%	19.34%	21.43%
LVD-2M	22.76%	76.20%	79.32%	51.40%	32.95%	20.60%	20.25%	21.29%

VBench evaluation results for two T2V diffusion models finetuned on LVD-2M and WebVid-10M separately, from the same base model. Finetuning on LVD-2M leads to more than 8% performance improvement than WebVid-10M for Dynamic Degree, Object Class and Human Action.

After finetuning a T2V diffusion model on LVD-2M, the videos are more dynamic, and the actions and objects in the videos are more reasonable, in contrast to finetuning on WebVid-10M.

Finetuning a Diffusion I2V Model
and a LM-based T2V Model

Human evaluation of generated videos by baseline v.s. fine-tuned models. We finetune both a diffusion-based I2V model and a LM-based T2V model on LVD-2M. Compared to the pretrained model, the finetuned models can generate more dynamic videos.

Gallery of Video-Text Pairs

The video captures an intense basketball game or practice session on an indoor court with distinctive orange, black, purple, and white color schemes. The players, dressed in athletic attire, engage in various actions such as dribbling, shooting, passing, defending, and interacting with each other. The court, equipped with multiple hoops, serves as a dynamic backdrop for the ongoing game, highlighting the competitive and active nature of the sport.

The video depicts a snowboarder performing various maneuvers on a snow-covered slope. The snowboarder, dressed in a red jacket and black pants, is seen progressing from a crouched position to mid-air jumps and landings, all captured from a consistent side-angle perspective. The mountainous landscape with trees suggests the setting is a ski resort or mountainous area suitable for snowboarding, with clear skies and natural lighting.

A person wearing a scuba diving suit and equipment is shown holding a large, vibrant orange fish with a prominent dorsal fin, likely a type of grouper or snapper, in a clear blue underwater environment. The video appears to be focused on the diver's interaction with the marine life, possibly for educational or recreational purposes.

The video captures a cyclist performing a sequence of maneuvers at a skate park. The cyclist, wearing a helmet, is seen preparing to descend a ramp, executing the descent with speed and control, and landing safely at the bottom. The setting is consistent throughout, with the skate park visible in the background and the lighting suggesting a late afternoon or early evening timeframe. The video showcases the skill and excitement involved in cycling at a skate park.

The video depicts a baby in a white onesie exploring its surroundings. The baby is seated and then stands, gripping a white rail or handle. The baby's facial expressions suggest a curious and playful mood as it interacts with the object. The setting appears to be a child-friendly room with a bed and toys visible in the background. The camera maintains a consistent perspective, focusing on the baby's actions and expressions throughout the sequence.

The video depicts a first-person perspective of a person riding a mountain bike on a dirt trail. The camera focuses on the handlebars, capturing the rider's hands gripping them tightly as they navigate turns and changes in the terrain. The blurred background suggests a natural, outdoor setting, likely a mountainous or hilly area, characteristic of a mountain biking trail. The video conveys a sense of speed and action through the changing angles of the handlebars and the continuous motion.

The video depicts a person learning to snowboard at an indoor snowboarding facility. The learner is wearing a blue jacket and is being assisted by another individual, both wearing snowboarding gear including helmets and goggles. The video shows the learner's progress, starting with the person holding the instructor's hand for support, then leaning forward as they begin to fall, and eventually falling onto the snow-covered ground, with the instructor remaining nearby to offer guidance.

A video of a motorcycle ride on a winding road under a clear blue sky. The video is captured from the perspective of the motorcycle rider, with the handlebars and mirrors visible in each frame. The road curves gently to the left, and the background remains consistent throughout the sequence.

BibTeX


@article{xiong2024lvd2m,
      title={LVD-2M: A Long-take Video Dataset with Temporally Dense Captions}, 
      author={Tianwei Xiong and Yuqing Wang and Daquan Zhou and Zhijie Lin and Jiashi Feng and Xihui Liu},
      year={2024},
      journal={arXiv preprint arXiv:2410.10816}
}

LVD-2M: A Long-take Video Dataset with Temporally Dense Captions