Synthetic Meta Evaluation Benchmark for Text-to-Long Video Generation

CVPR 2026
1. Tohoku University    2. LY Corporation

Abstract

We introduce SLVMEval, a benchmark for meta-evaluating text-to-video (T2V) evaluation systems. SLVMEval focuses on assessing these systems on long videos of up to 10,486 seconds (approximately 3 hours). Our benchmark targets a fundamental requirement: whether systems can accurately judge video quality in settings that are easy for humans to assess. We adopt a pairwise comparison-based meta-evaluation framework. Building on dense video captioning datasets, we synthetically degrade source videos to create controlled "high-quality vs. low-quality" pairs across 10 distinct aspects. We then use crowdsourcing to filter and retain only those pairs in which the degradation is clearly perceptible, thereby establishing the final testbed. Using this testbed, we assess the reliability of existing evaluation systems in ranking these pairs. Our experiments show that human evaluators identify the better long video with 84.7%–96.8% accuracy, while in 9 of the 10 aspects, the accuracy of these systems falls short of human judgment, revealing weaknesses in text-to-long video evaluation.

Overview

SLVMEval is designed to probe whether evaluation systems possess the minimum capability required for T2V model development. We target long videos — spanning several minutes to nearly three hours — which are precisely the regimes where current automatic systems struggle most. By constructing paired videos that differ in exactly one specified quality dimension, we enable precise, fine-grained meta-evaluation.

  • 3,932 Video Pairs — covering 1,461 unique prompts across 10 evaluation aspects.
  • Long-Video Focus — videos up to 10,486 seconds (~3 hours), far beyond existing benchmarks.
  • Human-Validated — crowdsourced annotation ensures pairs are clearly distinguishable by humans (84.7%–96.8% accuracy).

Dataset Comparison

SLVMEval significantly extends existing meta-evaluation benchmarks by targeting substantially longer videos and providing human-annotated pairwise judgments.

Benchmark Human Annot. Aspects Videos Unique Prompts Max Duration (sec) Avg. Prompt Len. (chars)
UVE-Bench 15 1,045 293 6.1 73.68
VBench 16 21,110 968 3.3 41.32
VBench Long 16 N/A 944 N/A 41.00
SLVMEval (ours) 10 3,932 1,461 10,486.0 57,883.52

Evaluation Aspects

SLVMEval covers 10 evaluation aspects organized into two categories. For each aspect, we construct paired videos by applying a controlled synthetic degradation to the original, keeping all other factors unchanged.

# Category Aspect Degradation
1 Video Quality Aesthetics Reduce the contrast of selected clips to assess frame-level aesthetic quality.
2 Video Quality Technical Quality Downscale the resolution to detect low-resolution artifacts.
3 Video Quality Appearance Style Apply mismatched artistic style transfer (e.g., oil painting, manga, sketch) to selected clips.
4 Video Quality Background Consistency Replace original backgrounds with random landscape images to break temporal visual stability.
5 Video Quality Object Integrity Erase prompt-specified objects via inpainting to remove key scene elements.
6 Video-Text Consistency Color Edit colors of prompt-specified objects to different colors, breaking color consistency with the description.
7 Video-Text Consistency Dynamics Degree Replace motion-containing clips with static middle frames to reduce dynamics described in the prompt.
8 Video-Text Consistency Comprehensiveness Remove several clips from the original video, reducing coverage of all prompt-described events.
9 Video-Text Consistency Spatial Relationship Horizontally flip clips mentioning left/right relations, inverting spatial consistency.
10 Video-Text Consistency Temporal Flow Relocate several consecutive clips to random positions, breaking the described event order.

Video Examples

Examples of original and degraded video pairs from SLVMEval across different evaluation aspects. Each pair is constructed so that the degradation is clearly perceptible to human annotators.

Experimental Results

Accuracy (%) of each baseline system on SLVMEval. Numbers are accuracy % ± 95% CI. Blue bold = best per aspect; green = second best. Chance level = 50%.

Human evaluators achieve 84.7%–96.8% accuracy across all 10 aspects, while in 9 of the 10 aspects, the accuracy of automatic evaluation systems falls short of human judgment.

System Video Quality Video-Text Consistency
Aesthetics Technical
Quality
Appearance
Style
Background
Consistency
Object
Integrity
Color Dynamics
Degree
Comprehen
siveness
Spatial
Relationship
Temporal
Flow
Video-based
GPT-5 90.1±2.5 85.8±4.2 88.9±2.5 98.9±0.8 72.0±6.2 84.3±3.5 35.3±3.6 51.3±4.5 59.7±4.4 50.3±4.1
GPT-5-mini 84.0±3.0 48.1±6.1 78.0±3.2 95.2±1.6 66.5±6.5 69.4±4.5 31.5±3.5 45.7±4.5 51.1±4.5 43.7±4.1
Qwen3 55.7±4.1 51.9±6.1 55.3±3.9 49.7±3.7 38.5±6.7 48.4±4.9 50.0±3.8 51.7±4.5 51.7±4.5 50.2±4.1
Text-based
GPT-5 74.8±3.6 46.2±6.1 81.1±3.1 83.8±2.7 68.0±6.5 68.9±4.5 43.1±3.8 50.6±4.5 47.0±4.5 43.5±4.1
GPT-5-mini 75.0±3.6 53.8±6.1 79.6±3.2 81.1±2.9 65.5±6.6 71.8±4.4 43.8±3.8 50.6±4.5 51.1±4.5 41.2±4.0
Qwen3 51.6±4.1 50.0±6.1 72.4±3.5 73.0±3.3 51.0±6.9 61.0±4.7 48.6±3.9 52.7±4.5 51.7±4.5 52.9±4.5
CLIPScore 56.4±5.8 72.3±7.7 53.2±5.5 68.6±4.8 76.0±8.4 66.2±6.5 51.7±5.4 57.4±6.3 55.1±6.3 50.5±5.8
VideoScore 52.5±5.8 33.8±8.3 65.7±5.3 71.2±4.7 66.0±9.3 33.8±6.5 52.7±4.9 34.5±6.1 49.6±6.4 46.3±5.8
🌟 Human 96.5±2.1 91.8±4.7 95.2±2.4 95.0±2.3 86.6±6.7 96.8±2.4 95.9±2.1 84.7±4.6 88.2±4.1 86.6±4.0

Key Findings

  • 9/10 aspects: Automatic systems fall short. All current automatic evaluation systems — including GPT-5 — lag behind human performance on 9 out of 10 evaluation aspects, despite these tasks being easy for humans.
  • Duration sensitivity. For most evaluation aspects, automatic system accuracy decreases as video duration increases (negative Spearman ρS), revealing critical limitations for long-video evaluation.
  • Strong human baseline. Human evaluators achieve 84.7%–96.8% accuracy across all 10 aspects, establishing a strong human-level target for future T2V evaluation systems to reach.
  • Reliable synthetic degradation. High Pearson correlation (ρP > 0.94) between filtered and unfiltered results shows our degradation pipeline produces reliable pairs without costly manual filtering.

BibTeX

@inproceedings{matsuda2026slvmeval,
  title     = {SLVMEval: Synthetic Meta Evaluation Benchmark for Text-to-Long Video Generation},
  author    = {Ryosuke Matsuda and Keito Kudo and Haruto Yoshida and Nobuyuki Shimizu and Jun Suzuki},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year      = {2026},
}

Acknowledgement

This website is adapted from VDocRAG and Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.