ScalingNoise: Scaling Inference-Time Search for Generating Infinite Videos

1MBZUAI 2Monash University 3Shanghai AI Lab 5HKUST 6Shanghai Jiaotong University

Abstract

Video diffusion models (VDMs) facilitate the generation of high-quality videos, with current research predominantly concentrated on scaling efforts during training through improvements in data quality, computational resources, and model complexity. However, inference-time scaling has received less attention, with most approaches restricting models to a single generation attempt. Recent studies have uncovered the existence of “golden noise" that can enhance video quality during generation. Building on this, we find that guiding the scaling inference-time search of VDMs to identify better noise candidates not only evaluates the quality of the frames generated in the current step but also preserves the high-level object features by referencing the anchor frame from previous multi-chunks, thereby delivering long-term value. Our analysis reveals that diffusion models inherently possess flexible adjustments of computation by varying denoising steps, and even a one-step denoising approach, when guided by a reward signal, yields significant long-term benefits. Based on the observation, we propose ScalingNoise, a plug-and-play inference-time search strategy that identifies golden initial noise for the diffusion sampling process to improve global content consistency and visual diversity. Specifically, we perform one-step denoising to convert initial noise into a clip and subsequently evaluate its long-term value, leveraging a reward model anchored by previously generated content. Moreover, to preserve diversity, we sample candidates from a tilted noise distribution that up-weights promising noises. In this way, ScalingNoise significantly reduces noise-induced errors, ensuring more coherent and spatiotemporally consistent video generation. Extensive experiments on benchmark datasets demonstrate that the proposed ScalingNoise effectively improves both content fidelity and subject consistency for resource-constrained long video generation.

Methodology

image
The overview of proposed method, ScalingNoise, which improves long video generation through inference-time search, replaces the training-time scaling. Specailly, our strategy mitigates this by conducting a tailored step-by-step beam search for suitable initial noise, guided by a reward model that incorporates an anchor frame to ensure a long-term signal. At each step, we perform one-step denoising on candidate noises to obtain a clearer clip for evaluation; the reward model then predicts the long-term value of each candidate, helping avoid noise that could introduce future inconsistencies.

FIFO-Diffusion+ScaleNoise

BibTeX


@misc{yang2025scalingnoisescalinginferencetimesearch,
  title={ScalingNoise: Scaling Inference-Time Search for Generating Infinite Videos}, 
  author={Haolin Yang and Feilong Tang and Ming Hu and Yulong Li and Yexin Liu and Zelin Peng and Junjun He and Zongyuan Ge and Imran Razzak},
  year={2025},
  eprint={2503.16400},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2503.16400}, 
}