Synthetic Meta Evaluation Benchmark for Text-to-Long Video Generation

CVPR 2026
1. Tohoku University    2. LY Corporation

Abstract

We introduce SLVMEval, a benchmark for meta-evaluating text-to-video (T2V) evaluation systems. SLVMEval focuses on assessing these systems on long videos of up to 10,486 seconds (approximately 3 hours). Our benchmark targets a fundamental requirement: whether systems can accurately judge video quality in settings that are easy for humans to assess. We adopt a pairwise comparison-based meta-evaluation framework. Building on dense video captioning datasets, we synthetically degrade source videos to create controlled "high-quality vs. low-quality" pairs across 10 distinct aspects. We then use crowdsourcing to filter and retain only those pairs in which the degradation is clearly perceptible, thereby establishing the final testbed. Using this testbed, we assess the reliability of existing evaluation systems in ranking these pairs. Our experiments show that human evaluators identify the better long video with 84.7%–96.8% accuracy, while in 9 of the 10 aspects, the accuracy of these systems falls short of human judgment, revealing weaknesses in text-to-long video evaluation.

Overview

SLVMEval is designed to probe whether evaluation systems possess the minimum capability required for T2V model development. We target long videos — spanning several minutes to nearly three hours — which are precisely the regimes where current automatic systems struggle most. By constructing paired videos that differ in exactly one specified quality dimension, we enable precise, fine-grained meta-evaluation.

  • 3,932 Video Pairs — covering 1,461 unique prompts across 10 evaluation aspects.
  • Long-Video Focus — videos up to 10,486 seconds (~3 hours), far beyond existing benchmarks.
  • Human-Validated — crowdsourced annotation ensures pairs are clearly distinguishable by humans (84.7%–96.8% accuracy).

Dataset Comparison

SLVMEval significantly extends existing meta-evaluation benchmarks by targeting substantially longer videos and providing human-annotated pairwise judgments.

Benchmark Human Annot. Aspects Videos Unique Prompts Max Duration (sec) Avg. Prompt Len. (chars)
UVE-Bench 15 1,045 293 6.1 73.68
VBench 16 21,110 968 3.3 41.32
VBench Long 16 N/A 944 N/A 41.00
SLVMEval (ours) 10 3,932 1,461 10,486.0 57,883.52

Evaluation Aspects

SLVMEval covers 10 evaluation aspects organized into two categories. For each aspect, we construct paired videos by applying a controlled synthetic degradation to the original, keeping all other factors unchanged.

# Category Aspect Degradation
1 Video Quality Aesthetics Reduce the contrast of selected clips to assess frame-level aesthetic quality.
2 Video Quality Technical Quality Downscale the resolution to detect low-resolution artifacts.
3 Video Quality Appearance Style Apply mismatched artistic style transfer (e.g., oil painting, manga, sketch) to selected clips.
4 Video Quality Background Consistency Replace original backgrounds with random landscape images to break temporal visual stability.
5 Video Quality Object Integrity Erase prompt-specified objects via inpainting to remove key scene elements.
6 Video-Text Consistency Color Edit colors of prompt-specified objects to different colors, breaking color consistency with the description.
7 Video-Text Consistency Dynamics Degree Replace motion-containing clips with static middle frames to reduce dynamics described in the prompt.
8 Video-Text Consistency Comprehensiveness Remove several clips from the original video, reducing coverage of all prompt-described events.
9 Video-Text Consistency Spatial Relationship Horizontally flip clips mentioning left/right relations, inverting spatial consistency.
10 Video-Text Consistency Temporal Flow Relocate several consecutive clips to random positions, breaking the described event order.

Video Examples

Examples of original and degraded video pairs from SLVMEval across different evaluation aspects. Each pair is constructed so that the degradation is clearly perceptible to human annotators. Pressing play on either video plays both simultaneously.

Video Quality
Aesthetics
Original
Degraded
Degradation Operation: Reduce the contrast of selected clips using FFmpeg’s eq filter with contrast set to −0.8.
Video Description
The video shows a close-up of a half sandwich and a side of fries with toppings on a wooden serving board. The sandwich appears to be filled with greens and possibly a protein, held in someone's hand. The environment seems casual, potentially a street food setting or a casual dining restaurant. The light is warm and diffused, casting soft shadows and highlighting the textures of the food. The colors are primarily the golden brown of the fries, the green of the sandwich filling, and the red and green of the fry seasonings.
Technical Quality
Original
Degraded
Degradation Operation: Downscale the resolution of selected clips, then upscale back — reducing sharpness and fine detail.
Video Description
In a setting that seems like an indoor studio or room with a cozy ambiance, the video features a brown tabby cat named Tweeter. This domestic long-hair cat, indicated to be an 11-year-old spayed female, is wearing a bright yellow tag. The feline appears healthy and well-cared for, with a fluffy fur coat. She sits calmly, being petted by someone off-camera, occasionally turning her head to look around. Her inquisitive nature is highlighted by the voice-over that mentions her keen interest in observing the activities of other animals.
Appearance Style
Original
Degraded
Degradation Operation: Apply a mismatched artistic style transfer (cartoon, oil painting, manga, watercolor, or sketch) to selected clips using OpenCV filters.
Video Description
The video frames show a street food vendor's display case, likely for selling traditional Vietnamese snacks. Inside the glass display are neatly arranged, segmented compartments filled with various types of sliced foods. We can see crisp-looking green slices, possibly some kind of pickled vegetables, next to off-white and beige slices that could be some form of processed or raw tubers. In the top left corner, there's a basket holding deep-fried items with a golden-brown hue, which the vendor is seen picking up with their gloved hand.
Background Consistency
Original
Degraded
Degradation Operation: Remove original backgrounds using rembg, then replace them with randomly sampled landscape images from the nature-dataset.
Video Description
The clip opens with a close-up of hands holding a small item, possibly food, with a focus on the textures and colors of the item and skin. The background is indistinct, highlighting the subject's action. The scene cuts to a wider shot featuring a group of people outdoors under a clear sky, with one holding an object high above, which is indicated as a 'potato'. The environment is lively, and the lighting suggests daytime. The colors are vibrant with diverse clothing styles hinting at cultural diversity.
Object Integrity
Original
Degraded
Degradation Operation: Extract object names from prompts, localize them with Grounding DINO, then erase them via Stable-Diffusion-Inpainting.
Video Description
The clip features a medium close-up of a person seated inside a vehicle, likely to emphasize the subject and the vehicle's interior features. The person is wearing a grey polo shirt with a red logo on the left side, gesturing with their hands to emphasize the wide range of vehicle accessories available. Their actions suggest they are explaining how customizable these vehicles are. The vehicle interior is black, with visible branding 'VHT-X' on the seat and a clear view of the vehicle's steering column and door frame.
Video-Text Consistency
Color
Original
Degraded
Degradation Operation: Identify clips mentioning object colors from the prompt, then modify the colors of those objects in all frames using Qwen-Image-Edit.
Video Description
The clip opens with a full-screen blue background featuring the ESA (European Space Agency) logo in white. The ESA logo consists of a circular design with an array of lines and dots forming what appears to be a stylized 'E' within a globe representation. Then additional elements appear: the logos of NASA, JAXA (Japan Aerospace Exploration Agency), Roscosmos (Russian space agency), and CSA (Canadian Space Agency) line up alongside ESA's logo. The color palette is mainly blue and white, with some touches of national colors in the various space agencies' logos.
Dynamics Degree
Original
Degraded
Degradation Operation: Identify motion-related clips from the prompt, then replace each frame in those clips with the middle frame — effectively producing a static clip.
Video Description
The video showcases a hermit crab with a grey, patterned shell moving across a sandy surface, likely at Dry Tortugas National Park. The crab's shell appears large relative to its body, suggesting it's an adult. As the crab moves, its legs and claws are visible, displaying reddish-brown colors that contrast with its pale shell. The focus remains solely on the crab's movement and behavior as it navigates the terrain. The scene is devoid of human activity, highlighting the natural habitat of the creature.
Comprehensiveness
Original (3 segments)
Seg 1 / 3
Degraded (2 segments)
欠損セグメント
Missing Segment
Seg 1 / 2
Phase 1 Orig: Seg 1 / 3 (5 s) Deg: Seg 1 / 2 (5 s)
Phase 2 Orig: Seg 2 / 3 (5 s) Deg: ✂ Remove scene
Phase 3 Orig: Seg 3 / 3 (10 s) Deg: Seg 2 / 2 (10 s)
Degradation Operation: Randomly remove 5 clips from the original video, reducing coverage of all prompt-described events.
Video Description
The individual, wearing a long-sleeve black shirt and jeans, stands on a dimly lit stage with a casual posture, holding what appears to be a remote or clicker. There are two birch tree stumps of different heights positioned to the speaker's left. To the right, there is a white bench with vertical slats, suggesting a simple, nature-inspired set design. The stage floor is dark, contrasting with the lighter backdrop. The colors are muted with blacks, whites, and greys dominating the scene. The degraded version removes the middle segment, leaving only 2 of the original 3 segments (Seg 2 is missing from the degraded video).
Spatial Relationship
Original
Degraded
Degradation Operation: Identify clips mentioning left/right spatial relations, then horizontally flip all frames in those clips.
Video Description
The video shows a detailed view of a toy, specifically a Kamen Rider Gaim's Sengoku Driver with an attached Genesis Core on the left side. The driver is predominantly black with metallic and gold accents, displaying an LED screen reading '5.13' and a handle on the right. In the background, there is a nondescript beige wall. The light source appears to be coming from the front, illuminating the toy evenly without harsh shadows. In the last frame, a hand is inserting a pink and blue lockseed into the right side of the driver.
Temporal Flow
Original
Degraded
Degradation Operation: Move 5 consecutive clips to random positions, thereby breaking the temporal order of events.
Video Description
A close-up shot focuses on the torso of an individual wearing a black utility vest with multiple pockets and attachments. The person appears to be holding an electronic device, possibly a tablet, which has a logo resembling a flame, and is housed in a black protective case with clips. A pen is also visible, secured on the vest. The vest has a velcro patch area but no discernible patches attached. The individual's other hand is partially visible, holding the tablet or adjusting its position. The vest and tablet are predominantly black, providing a stark contrast to the white background.

Experimental Results

Accuracy (%) of each baseline system on SLVMEval. Numbers are accuracy % ± 95% CI. Blue bold = best per aspect; green = second best. Chance level = 50%.

Human evaluators achieve 84.7%–96.8% accuracy across all 10 aspects, while in 9 of the 10 aspects, the accuracy of automatic evaluation systems falls short of human judgment.

System Video Quality Video-Text Consistency
Aesthetics Technical
Quality
Appearance
Style
Background
Consistency
Object
Integrity
Color Dynamics
Degree
Comprehen
siveness
Spatial
Relationship
Temporal
Flow
Video-based
GPT-5 90.1±2.5 85.8±4.2 88.9±2.5 98.9±0.8 72.0±6.2 84.3±3.5 35.3±3.6 51.3±4.5 59.7±4.4 50.3±4.1
GPT-5-mini 84.0±3.0 48.1±6.1 78.0±3.2 95.2±1.6 66.5±6.5 69.4±4.5 31.5±3.5 45.7±4.5 51.1±4.5 43.7±4.1
Qwen3 55.7±4.1 51.9±6.1 55.3±3.9 49.7±3.7 38.5±6.7 48.4±4.9 50.0±3.8 51.7±4.5 51.7±4.5 50.2±4.1
Text-based
GPT-5 74.8±3.6 46.2±6.1 81.1±3.1 83.8±2.7 68.0±6.5 68.9±4.5 43.1±3.8 50.6±4.5 47.0±4.5 43.5±4.1
GPT-5-mini 75.0±3.6 53.8±6.1 79.6±3.2 81.1±2.9 65.5±6.6 71.8±4.4 43.8±3.8 50.6±4.5 51.1±4.5 41.2±4.0
Qwen3 51.6±4.1 50.0±6.1 72.4±3.5 73.0±3.3 51.0±6.9 61.0±4.7 48.6±3.9 52.7±4.5 51.7±4.5 52.9±4.5
CLIPScore 56.4±5.8 72.3±7.7 53.2±5.5 68.6±4.8 76.0±8.4 66.2±6.5 51.7±5.4 57.4±6.3 55.1±6.3 50.5±5.8
VideoScore 52.5±5.8 33.8±8.3 65.7±5.3 71.2±4.7 66.0±9.3 33.8±6.5 52.7±4.9 34.5±6.1 49.6±6.4 46.3±5.8
🌟 Human 96.5±2.1 91.8±4.7 95.2±2.4 95.0±2.3 86.6±6.7 96.8±2.4 95.9±2.1 84.7±4.6 88.2±4.1 86.6±4.0

Key Findings

  • 9/10 aspects: Automatic systems fall short. All current automatic evaluation systems — including GPT-5 — lag behind human performance on 9 out of 10 evaluation aspects, despite these tasks being easy for humans.
  • Duration sensitivity. For most evaluation aspects, automatic system accuracy decreases as video duration increases (negative Spearman ρS), revealing critical limitations for long-video evaluation.
  • Strong human baseline. Human evaluators achieve 84.7%–96.8% accuracy across all 10 aspects, establishing a strong human-level target for future T2V evaluation systems to reach.
  • Reliable synthetic degradation. High Pearson correlation (ρP > 0.94) between filtered and unfiltered results shows our degradation pipeline produces reliable pairs without costly manual filtering.

BibTeX

@inproceedings{matsuda2026slvmeval,
  title     = {SLVMEval: Synthetic Meta Evaluation Benchmark for Text-to-Long Video Generation},
  author    = {Ryosuke Matsuda and Keito Kudo and Haruto Yoshida and Nobuyuki Shimizu and Jun Suzuki},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year      = {2026},
}

Acknowledgement

This website is adapted from VDocRAG and Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.