Generative AI

Memory-V2V: Augmenting Video-to-Video Diffusion Models with Memory

DDohun LeeCChun-Hao Paul HuangXXuelin ChenJJong Chul YeDDuygu CeylanHHyeonho Jeong
Published
January 22, 2026
Authors
6
Word Count
10,132
Code
Includes code

Memory-V2V enables consistent multi-turn video editing.

Abstract

Recent foundational video-to-video diffusion models have achieved impressive results in editing user provided videos by modifying appearance, motion, or camera movement. However, real-world video editing is often an iterative process, where users refine results across multiple rounds of interaction. In this multi-turn setting, current video editors struggle to maintain cross-consistency across sequential edits. In this work, we tackle, for the first time, the problem of cross-consistency in multi-turn video editing and introduce Memory-V2V, a simple, yet effective framework that augments existing video-to-video models with explicit memory. Given an external cache of previously edited videos, Memory-V2V employs accurate retrieval and dynamic tokenization strategies to condition the current editing step on prior results. To further mitigate redundancy and computational overhead, we propose a learnable token compressor within the DiT backbone that compresses redundant conditioning tokens while preserving essential visual cues, achieving an overall speedup of 30%. We validate Memory-V2V on challenging tasks including video novel view synthesis and text-conditioned long video editing. Extensive experiments show that Memory-V2V produces videos that are significantly more cross-consistent with minimal computational overhead, while maintaining or even improving task-specific performance over state-of-the-art baselines. Project page: https://dohunlee1.github.io/MemoryV2V

Key Takeaways

  • 1

    Memory-V2V enhances iterative video editing consistency.

  • 2

    Utilizes efficient memory mechanisms for scalability.

  • 3

    Outperforms baselines in novel view synthesis and text-guided editing.

Limitations

  • Requires external memory storage for past edits.

  • Computational cost may increase with more edits.

Keywords

video-to-video diffusion modelscross-consistencymulti-turn video editingmemory augmentationretrievaldynamic tokenizationtoken compressorDiT backbonevideo novel view synthesistext-conditioned long video editing

More in Generative AI

View all
Memory-V2V: Augmenting Video-to-Video Diffusion Models with Memory | Paperchime