SLVMEval

Abstract

We introduce SLVMEval, a benchmark for meta-evaluating text-to-video (T2V) evaluation systems. SLVMEval focuses on assessing these systems on long videos of up to 10,486 seconds (approximately 3 hours). Our benchmark targets a fundamental requirement: whether systems can accurately judge video quality in settings that are easy for humans to assess. We adopt a pairwise comparison-based meta-evaluation framework. Building on dense video captioning datasets, we synthetically degrade source videos to create controlled "high-quality vs. low-quality" pairs across 10 distinct aspects. We then use crowdsourcing to filter and retain only those pairs in which the degradation is clearly perceptible, thereby establishing the final testbed. Using this testbed, we assess the reliability of existing evaluation systems in ranking these pairs. Our experiments show that human evaluators identify the better long video with 84.7%–96.8% accuracy, while in 9 of the 10 aspects, the accuracy of these systems falls short of human judgment, revealing weaknesses in text-to-long video evaluation.

Overview

SLVMEval is designed to probe whether evaluation systems possess the minimum capability required for T2V model development. We target long videos — spanning several minutes to nearly three hours — which are precisely the regimes where current automatic systems struggle most. By constructing paired videos that differ in exactly one specified quality dimension, we enable precise, fine-grained meta-evaluation.

3,932 Video Pairs — covering 1,461 unique prompts across 10 evaluation aspects.
Long-Video Focus — videos up to 10,486 seconds (~3 hours), far beyond existing benchmarks.
Human-Validated — crowdsourced annotation ensures pairs are clearly distinguishable by humans (84.7%–96.8% accuracy).

Dataset Comparison

SLVMEval significantly extends existing meta-evaluation benchmarks by targeting substantially longer videos and providing human-annotated pairwise judgments.

Benchmark	Human Annot.	Aspects	Videos	Unique Prompts	Max Duration (sec)	Avg. Prompt Len. (chars)
UVE-Bench	✓	15	1,045	293	6.1	73.68
VBench	✓	16	21,110	968	3.3	41.32
VBench Long	✗	16	N/A	944	N/A	41.00
SLVMEval (ours)	✓	10	3,932	1,461	10,486.0	57,883.52

Evaluation Aspects

SLVMEval covers 10 evaluation aspects organized into two categories. For each aspect, we construct paired videos by applying a controlled synthetic degradation to the original, keeping all other factors unchanged.

#	Category	Aspect	Degradation
1	Video Quality	Aesthetics	Reduce the contrast of selected clips to assess frame-level aesthetic quality.
2	Video Quality	Technical Quality	Downscale the resolution to detect low-resolution artifacts.
3	Video Quality	Appearance Style	Apply mismatched artistic style transfer (e.g., oil painting, manga, sketch) to selected clips.
4	Video Quality	Background Consistency	Replace original backgrounds with random landscape images to break temporal visual stability.
5	Video Quality	Object Integrity	Erase prompt-specified objects via inpainting to remove key scene elements.
6	Video-Text Consistency	Color	Edit colors of prompt-specified objects to different colors, breaking color consistency with the description.
7	Video-Text Consistency	Dynamics Degree	Replace motion-containing clips with static middle frames to reduce dynamics described in the prompt.
8	Video-Text Consistency	Comprehensiveness	Remove several clips from the original video, reducing coverage of all prompt-described events.
9	Video-Text Consistency	Spatial Relationship	Horizontally flip clips mentioning left/right relations, inverting spatial consistency.
10	Video-Text Consistency	Temporal Flow	Relocate several consecutive clips to random positions, breaking the described event order.

Video Examples

Examples of original and degraded video pairs from SLVMEval across different evaluation aspects. Each pair is constructed so that the degradation is clearly perceptible to human annotators. Pressing play on either video plays both simultaneously.

Video Quality

Aesthetics

Original

Degraded

Degradation Operation: Reduce the contrast of selected clips using FFmpeg’s eq filter with contrast set to −0.8.

Video Description

The video shows a close-up of a half sandwich and a side of fries with toppings on a wooden serving board. The sandwich appears to be filled with greens and possibly a protein, held in someone's hand. The environment seems casual, potentially a street food setting or a casual dining restaurant. The light is warm and diffused, casting soft shadows and highlighting the textures of the food. The colors are primarily the golden brown of the fries, the green of the sandwich filling, and the red and green of the fry seasonings.

Technical Quality

Original

Degraded

Degradation Operation: Downscale the resolution of selected clips, then upscale back — reducing sharpness and fine detail.

Video Description

In a setting that seems like an indoor studio or room with a cozy ambiance, the video features a brown tabby cat named Tweeter. This domestic long-hair cat, indicated to be an 11-year-old spayed female, is wearing a bright yellow tag. The feline appears healthy and well-cared for, with a fluffy fur coat. She sits calmly, being petted by someone off-camera, occasionally turning her head to look around. Her inquisitive nature is highlighted by the voice-over that mentions her keen interest in observing the activities of other animals.

Appearance Style

Original

Degraded

Degradation Operation: Apply a mismatched artistic style transfer (cartoon, oil painting, manga, watercolor, or sketch) to selected clips using OpenCV filters.

Video Description

The video frames show a street food vendor's display case, likely for selling traditional Vietnamese snacks. Inside the glass display are neatly arranged, segmented compartments filled with various types of sliced foods. We can see crisp-looking green slices, possibly some kind of pickled vegetables, next to off-white and beige slices that could be some form of processed or raw tubers. In the top left corner, there's a basket holding deep-fried items with a golden-brown hue, which the vendor is seen picking up with their gloved hand.

Background Consistency

Original

Degraded

Degradation Operation: Remove original backgrounds using rembg, then replace them with randomly sampled landscape images from the nature-dataset.

Video Description

The clip opens with a close-up of hands holding a small item, possibly food, with a focus on the textures and colors of the item and skin. The background is indistinct, highlighting the subject's action. The scene cuts to a wider shot featuring a group of people outdoors under a clear sky, with one holding an object high above, which is indicated as a 'potato'. The environment is lively, and the lighting suggests daytime. The colors are vibrant with diverse clothing styles hinting at cultural diversity.

Object Integrity

Original

Degraded

Degradation Operation: Extract object names from prompts, localize them with Grounding DINO, then erase them via Stable-Diffusion-Inpainting.

Video Description

The clip features a medium close-up of a person seated inside a vehicle, likely to emphasize the subject and the vehicle's interior features. The person is wearing a grey polo shirt with a red logo on the left side, gesturing with their hands to emphasize the wide range of vehicle accessories available. Their actions suggest they are explaining how customizable these vehicles are. The vehicle interior is black, with visible branding 'VHT-X' on the seat and a clear view of the vehicle's steering column and door frame.

Video-Text Consistency

Color

Original

Degraded

Degradation Operation: Identify clips mentioning object colors from the prompt, then modify the colors of those objects in all frames using Qwen-Image-Edit.

Video Description

The clip opens with a full-screen blue background featuring the ESA (European Space Agency) logo in white. The ESA logo consists of a circular design with an array of lines and dots forming what appears to be a stylized 'E' within a globe representation. Then additional elements appear: the logos of NASA, JAXA (Japan Aerospace Exploration Agency), Roscosmos (Russian space agency), and CSA (Canadian Space Agency) line up alongside ESA's logo. The color palette is mainly blue and white, with some touches of national colors in the various space agencies' logos.

Dynamics Degree

Original

Degraded

Degradation Operation: Identify motion-related clips from the prompt, then replace each frame in those clips with the middle frame — effectively producing a static clip.

Video Description

The video showcases a hermit crab with a grey, patterned shell moving across a sandy surface, likely at Dry Tortugas National Park. The crab's shell appears large relative to its body, suggesting it's an adult. As the crab moves, its legs and claws are visible, displaying reddish-brown colors that contrast with its pale shell. The focus remains solely on the crab's movement and behavior as it navigates the terrain. The scene is devoid of human activity, highlighting the natural habitat of the creature.

Comprehensiveness

Original (3 segments)

Seg 1 / 3

Degraded (2 segments)

✂

欠損セグメント

Missing Segment

Seg 1 / 2

Phase 1 Orig: Seg 1 / 3 (5 s) Deg: Seg 1 / 2 (5 s)

Phase 2 Orig: Seg 2 / 3 (5 s) Deg: ✂ Remove scene

Phase 3 Orig: Seg 3 / 3 (10 s) Deg: Seg 2 / 2 (10 s)

Degradation Operation: Randomly remove 5 clips from the original video, reducing coverage of all prompt-described events.

Video Description

The individual, wearing a long-sleeve black shirt and jeans, stands on a dimly lit stage with a casual posture, holding what appears to be a remote or clicker. There are two birch tree stumps of different heights positioned to the speaker's left. To the right, there is a white bench with vertical slats, suggesting a simple, nature-inspired set design. The stage floor is dark, contrasting with the lighter backdrop. The colors are muted with blacks, whites, and greys dominating the scene. The degraded version removes the middle segment, leaving only 2 of the original 3 segments (Seg 2 is missing from the degraded video).

Spatial Relationship

Original

Degraded

Degradation Operation: Identify clips mentioning left/right spatial relations, then horizontally flip all frames in those clips.

Video Description

The video shows a detailed view of a toy, specifically a Kamen Rider Gaim's Sengoku Driver with an attached Genesis Core on the left side. The driver is predominantly black with metallic and gold accents, displaying an LED screen reading '5.13' and a handle on the right. In the background, there is a nondescript beige wall. The light source appears to be coming from the front, illuminating the toy evenly without harsh shadows. In the last frame, a hand is inserting a pink and blue lockseed into the right side of the driver.

Temporal Flow

Original

Degraded

Degradation Operation: Move 5 consecutive clips to random positions, thereby breaking the temporal order of events.

Video Description

A close-up shot focuses on the torso of an individual wearing a black utility vest with multiple pockets and attachments. The person appears to be holding an electronic device, possibly a tablet, which has a logo resembling a flame, and is housed in a black protective case with clips. A pen is also visible, secured on the vest. The vest has a velcro patch area but no discernible patches attached. The individual's other hand is partially visible, holding the tablet or adjusting its position. The vest and tablet are predominantly black, providing a stark contrast to the white background.

Experimental Results

Accuracy (%) of each baseline system on SLVMEval. Numbers are accuracy % ± 95% CI. Blue bold = best per aspect; green = second best. Chance level = 50%.

Human evaluators achieve 84.7%–96.8% accuracy across all 10 aspects, while in 9 of the 10 aspects, the accuracy of automatic evaluation systems falls short of human judgment.

System	Video Quality					Video-Text Consistency
System	Aesthetics	Technical Quality	Appearance Style	Background Consistency	Object Integrity	Color	Dynamics Degree	Comprehen siveness	Spatial Relationship	Temporal Flow
Video-based
GPT-5	90.1_±2.5	85.8_±4.2	88.9_±2.5	98.9_±0.8	72.0_±6.2	84.3_±3.5	35.3_±3.6	51.3_±4.5	59.7_±4.4	50.3_±4.1
GPT-5-mini	84.0_±3.0	48.1_±6.1	78.0_±3.2	95.2_±1.6	66.5_±6.5	69.4_±4.5	31.5_±3.5	45.7_±4.5	51.1_±4.5	43.7_±4.1
Qwen3	55.7_±4.1	51.9_±6.1	55.3_±3.9	49.7_±3.7	38.5_±6.7	48.4_±4.9	50.0_±3.8	51.7_±4.5	51.7_±4.5	50.2_±4.1
Text-based
GPT-5	74.8_±3.6	46.2_±6.1	81.1_±3.1	83.8_±2.7	68.0_±6.5	68.9_±4.5	43.1_±3.8	50.6_±4.5	47.0_±4.5	43.5_±4.1
GPT-5-mini	75.0_±3.6	53.8_±6.1	79.6_±3.2	81.1_±2.9	65.5_±6.6	71.8_±4.4	43.8_±3.8	50.6_±4.5	51.1_±4.5	41.2_±4.0
Qwen3	51.6_±4.1	50.0_±6.1	72.4_±3.5	73.0_±3.3	51.0_±6.9	61.0_±4.7	48.6_±3.9	52.7_±4.5	51.7_±4.5	52.9_±4.5
CLIPScore	56.4_±5.8	72.3_±7.7	53.2_±5.5	68.6_±4.8	76.0_±8.4	66.2_±6.5	51.7_±5.4	57.4_±6.3	55.1_±6.3	50.5_±5.8
VideoScore	52.5_±5.8	33.8_±8.3	65.7_±5.3	71.2_±4.7	66.0_±9.3	33.8_±6.5	52.7_±4.9	34.5_±6.1	49.6_±6.4	46.3_±5.8
🌟 Human	96.5_±2.1	91.8_±4.7	95.2_±2.4	95.0_±2.3	86.6_±6.7	96.8_±2.4	95.9_±2.1	84.7_±4.6	88.2_±4.1	86.6_±4.0

Key Findings

9/10 aspects: Automatic systems fall short. All current automatic evaluation systems — including GPT-5 — lag behind human performance on 9 out of 10 evaluation aspects, despite these tasks being easy for humans.
Duration sensitivity. For most evaluation aspects, automatic system accuracy decreases as video duration increases (negative Spearman ρ_S), revealing critical limitations for long-video evaluation.
Strong human baseline. Human evaluators achieve 84.7%–96.8% accuracy across all 10 aspects, establishing a strong human-level target for future T2V evaluation systems to reach.
Reliable synthetic degradation. High Pearson correlation (ρ_P > 0.94) between filtered and unfiltered results shows our degradation pipeline produces reliable pairs without costly manual filtering.

BibTeX

@inproceedings{matsuda2026slvmeval,
  title     = {SLVMEval: Synthetic Meta Evaluation Benchmark for Text-to-Long Video Generation},
  author    = {Ryosuke Matsuda and Keito Kudo and Haruto Yoshida and Nobuyuki Shimizu and Jun Suzuki},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year      = {2026},
}

Acknowledgement

This website is adapted from VDocRAG and Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

SLVMEval

Synthetic Meta Evaluation Benchmark for Text-to-Long Video Generation

CVPR 2026

Abstract

Overview

Dataset Comparison

Evaluation Aspects

Video Examples

Experimental Results

Key Findings

BibTeX

Acknowledgement