Learning Plug-and-play Memory for Guiding Video Diffusion Models - Ziming Xu

Abstract

Diffusion Transformer (DiT) based video generation models have recently achieved impressive visual quality and temporal coherence, but they still frequently violate basic physical laws and commonsense dynamics, revealing a lack of explicit world knowledge. In this work, we explore to equip them with a plug-and-play memory that injects useful world knowledge. Motivated by in-context memory in Transformer-based LLMs, we conduct empirical studies to show that DiT can be steered via interventions on its hidden states, and simple low-pass and high-pass filters in the embedding space naturally disentangle low-level appearance and high-level physical/semantic cues, enabling targeted guidance. Building on these observations, we propose a learnable memory encoder DiT-Mem, composed of stacked 3D CNNs, low-/high-pass filters, and self-attention layers. The encoder maps reference videos into a compact set of memory tokens, which are concatenated as the memory within the DiT self-attention layers. During training, we keep the diffusion backbone frozen, and only optimize the memory encoder. It yields a rather efficient training process on few training parameters (e.g. 150M) and 10k data samples, and enables plug-and-play usage at inference time. Extensive experiments on state-of-the-art models demonstrate the effectiveness of our method in improving physical rule following and video fidelity.

Video Gallery

"A person is sword fighting"

"A polar bear is playing guitar"

"A person is pouring water into a teacup"

"golden fish swimming in the ocean"

"pouring milk into a kettle"

"A piece of copper is ignited, emitting a vivid and unique flame as it burns steadily"

"A panda playing on a swing set"

"A person is running on treadmill"

"A small burning match was thrown into a pile of dry leaves"

"A squeeze of honey is slowly released in the space station, spreading the liquid into the surrounding area"

"A drone is hovering above a quiet and glassy swimming pool, with the reflection of the drone appearing on the water surface"

"A metal fork is gently placed into a glass of crystal-clear water, displaying the interesting visual distortions and reflections as the fork meets the liquid"

Experimental Results

We evaluate our methods on three comprehensive benchmarks: (1) PhyGenBench targets physical commonsense in text-to-video generation with 160 prompts spanning 27 physical laws; (2) VBench provides hierarchical evaluation decomposing "quality" into well-defined dimensions; (3) OpenVidHD Test Set measures low-level video generation metrics (FVD, CLIP, Action/Motion) on 100 clips.

Comparison with baselines on PhyGenBench. Our models (Wan2.1+Ours and Wan2.2+Ours) achieve the best performance within their respective backbone groups across almost all physics domains.

Comparison on VBench, where metrics are grouped into low-level (appearance/texture) and high-level (semantic/dynamic) categories. The Wan2.2+Ours variant achieves the strongest low-level and high-level average scores among all compared models.

Evaluation on the OpenVidHD test set using standard low-level video generation metrics. Wan2.1+Ours maintains strong text–video alignment, action accuracy, and motion quality, while achieving the lowest FVD.

In-Depth Analysis

We further analyze how DiT-Mem behaves under architectural ablations, memory size changes, and frequency branch selections. Use the slider below to navigate across the studies.

Visual ablation study of DiT-Mem components

Ablation study on DiT-Mem-1.3B evaluating the contribution of each component of the proposed memory encoder. We report variants obtained by progressively adding the 3D convolution layers (3D), high-pass filter (HPF), low-pass filter (LPF), and attention modules—either separate attention (SPA) or shared attention (SA).

Effect of memory size on PhyGenBench performance using DiT-Mem-1.3B. Larger memory banks provide richer reference knowledge and yield consistently higher scores, while the model retains strong robustness even with significantly reduced memory capacity.

Frequency ablation on VBench with metrics grouped into low- and high-level categories. LPF-only performs better on appearance-related metrics, while HPF-only excels on semantic/dynamic ones, validating the complementary roles of the two frequency filters.

BibTeX

@misc{2511.19229,
  Author = {Selena Song and Ziming Xu and Zijun Zhang and Kun Zhou and Jiaxian Guo and Lianhui Qin and Biwei Huang},
  Title = {Learning Plug-and-play Memory for Guiding Video Diffusion Models},
  Year = {2025},
  Eprint = {arXiv:2511.19229},
}