SLVMEval

Abstract

We introduce SLVMEval, a benchmark for meta-evaluating text-to-video (T2V) evaluation systems. SLVMEval focuses on assessing these systems on long videos of up to 10,486 seconds (approximately 3 hours). Our benchmark targets a fundamental requirement: whether systems can accurately judge video quality in settings that are easy for humans to assess. We adopt a pairwise comparison-based meta-evaluation framework. Building on dense video captioning datasets, we synthetically degrade source videos to create controlled "high-quality vs. low-quality" pairs across 10 distinct aspects. We then use crowdsourcing to filter and retain only those pairs in which the degradation is clearly perceptible, thereby establishing the final testbed. Using this testbed, we assess the reliability of existing evaluation systems in ranking these pairs. Our experiments show that human evaluators identify the better long video with 84.7%–96.8% accuracy, while in 9 of the 10 aspects, the accuracy of these systems falls short of human judgment, revealing weaknesses in text-to-long video evaluation.

Overview

SLVMEval is designed to probe whether evaluation systems possess the minimum capability required for T2V model development. We target long videos — spanning several minutes to nearly three hours — which are precisely the regimes where current automatic systems struggle most. By constructing paired videos that differ in exactly one specified quality dimension, we enable precise, fine-grained meta-evaluation.

3,932 Video Pairs — covering 1,461 unique prompts across 10 evaluation aspects.
Long-Video Focus — videos up to 10,486 seconds (~3 hours), far beyond existing benchmarks.
Human-Validated — crowdsourced annotation ensures pairs are clearly distinguishable by humans (84.7%–96.8% accuracy).

Dataset Comparison

SLVMEval significantly extends existing meta-evaluation benchmarks by targeting substantially longer videos and providing human-annotated pairwise judgments.

Benchmark	Human Annot.	Aspects	Videos	Unique Prompts	Max Duration (sec)	Avg. Prompt Len. (chars)
UVE-Bench	✓	15	1,045	293	6.1	73.68
VBench	✓	16	21,110	968	3.3	41.32
VBench Long	✗	16	N/A	944	N/A	41.00
SLVMEval (ours)	✓	10	3,932	1,461	10,486.0	57,883.52

Evaluation Aspects

SLVMEval covers 10 evaluation aspects organized into two categories. For each aspect, we construct paired videos by applying a controlled synthetic degradation to the original, keeping all other factors unchanged.

#	Category	Aspect	Degradation
1	Video Quality	Aesthetics	Reduce the contrast of selected clips to assess frame-level aesthetic quality.
2	Video Quality	Technical Quality	Downscale the resolution to detect low-resolution artifacts.
3	Video Quality	Appearance Style	Apply mismatched artistic style transfer (e.g., oil painting, manga, sketch) to selected clips.
4	Video Quality	Background Consistency	Replace original backgrounds with random landscape images to break temporal visual stability.
5	Video Quality	Object Integrity	Erase prompt-specified objects via inpainting to remove key scene elements.
6	Video-Text Consistency	Color	Edit colors of prompt-specified objects to different colors, breaking color consistency with the description.
7	Video-Text Consistency	Dynamics Degree	Replace motion-containing clips with static middle frames to reduce dynamics described in the prompt.
8	Video-Text Consistency	Comprehensiveness	Remove several clips from the original video, reducing coverage of all prompt-described events.
9	Video-Text Consistency	Spatial Relationship	Horizontally flip clips mentioning left/right relations, inverting spatial consistency.
10	Video-Text Consistency	Temporal Flow	Relocate several consecutive clips to random positions, breaking the described event order.

Video Examples

Examples of original and degraded video pairs from SLVMEval across different evaluation aspects. Each pair is constructed so that the degradation is clearly perceptible to human annotators.

Experimental Results

Accuracy (%) of each baseline system on SLVMEval. Numbers are accuracy % ± 95% CI. Blue bold = best per aspect; green = second best. Chance level = 50%.

Human evaluators achieve 84.7%–96.8% accuracy across all 10 aspects, while in 9 of the 10 aspects, the accuracy of automatic evaluation systems falls short of human judgment.

System	Video Quality					Video-Text Consistency
System	Aesthetics	Technical Quality	Appearance Style	Background Consistency	Object Integrity	Color	Dynamics Degree	Comprehen siveness	Spatial Relationship	Temporal Flow
Video-based
GPT-5	90.1_±2.5	85.8_±4.2	88.9_±2.5	98.9_±0.8	72.0_±6.2	84.3_±3.5	35.3_±3.6	51.3_±4.5	59.7_±4.4	50.3_±4.1
GPT-5-mini	84.0_±3.0	48.1_±6.1	78.0_±3.2	95.2_±1.6	66.5_±6.5	69.4_±4.5	31.5_±3.5	45.7_±4.5	51.1_±4.5	43.7_±4.1
Qwen3	55.7_±4.1	51.9_±6.1	55.3_±3.9	49.7_±3.7	38.5_±6.7	48.4_±4.9	50.0_±3.8	51.7_±4.5	51.7_±4.5	50.2_±4.1
Text-based
GPT-5	74.8_±3.6	46.2_±6.1	81.1_±3.1	83.8_±2.7	68.0_±6.5	68.9_±4.5	43.1_±3.8	50.6_±4.5	47.0_±4.5	43.5_±4.1
GPT-5-mini	75.0_±3.6	53.8_±6.1	79.6_±3.2	81.1_±2.9	65.5_±6.6	71.8_±4.4	43.8_±3.8	50.6_±4.5	51.1_±4.5	41.2_±4.0
Qwen3	51.6_±4.1	50.0_±6.1	72.4_±3.5	73.0_±3.3	51.0_±6.9	61.0_±4.7	48.6_±3.9	52.7_±4.5	51.7_±4.5	52.9_±4.5
CLIPScore	56.4_±5.8	72.3_±7.7	53.2_±5.5	68.6_±4.8	76.0_±8.4	66.2_±6.5	51.7_±5.4	57.4_±6.3	55.1_±6.3	50.5_±5.8
VideoScore	52.5_±5.8	33.8_±8.3	65.7_±5.3	71.2_±4.7	66.0_±9.3	33.8_±6.5	52.7_±4.9	34.5_±6.1	49.6_±6.4	46.3_±5.8
🌟 Human	96.5_±2.1	91.8_±4.7	95.2_±2.4	95.0_±2.3	86.6_±6.7	96.8_±2.4	95.9_±2.1	84.7_±4.6	88.2_±4.1	86.6_±4.0

Key Findings

9/10 aspects: Automatic systems fall short. All current automatic evaluation systems — including GPT-5 — lag behind human performance on 9 out of 10 evaluation aspects, despite these tasks being easy for humans.
Duration sensitivity. For most evaluation aspects, automatic system accuracy decreases as video duration increases (negative Spearman ρ_S), revealing critical limitations for long-video evaluation.
Strong human baseline. Human evaluators achieve 84.7%–96.8% accuracy across all 10 aspects, establishing a strong human-level target for future T2V evaluation systems to reach.
Reliable synthetic degradation. High Pearson correlation (ρ_P > 0.94) between filtered and unfiltered results shows our degradation pipeline produces reliable pairs without costly manual filtering.

BibTeX

@inproceedings{matsuda2026slvmeval,
  title     = {SLVMEval: Synthetic Meta Evaluation Benchmark for Text-to-Long Video Generation},
  author    = {Ryosuke Matsuda and Keito Kudo and Haruto Yoshida and Nobuyuki Shimizu and Jun Suzuki},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year      = {2026},
}

Acknowledgement

This website is adapted from VDocRAG and Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.