• 1University of Illinois at Urbana-Champaign

Best viewed in Google Chrome.

Abstract

Modern visual effects (VFX) software has made it possible for skilled artists to create imagery of virtually anything. However, the creation process remains laborious, complex, and largely inaccessible to everyday users. In this work, we present AutoVFX, a framework that automatically creates realistic and dynamic VFX videos from a single video and natural language instructions. By carefully integrating neural scene modeling, LLM-based code generation, and physical simulation, AutoVFX is able to provide physically-grounded, photorealistic editing effects that can be controlled directly using natural language instructions. We conduct extensive experiments to validate AutoVFX's efficacy across a diverse spectrum of videos and instructions. Quantitative and qualitative results suggest that AutoVFX outperforms all competing methods by a large margin in generative quality, instruction alignment, editing versatility, and physical plausibility.
overview

Dynamic visual effects on videos

Gardenverse

Throw a basketball with fire towards vase with flowers and break the vase with collision.

Indoor scenes

Insert an animated dragon moving above and around the floor.

Outdoor & autonomous driving scenes

Insert a physics-enabled Benz G 20 meters in front of us with random 2D rotation. Add a Ferriari moving forward.

Comparison to related works

Baselines

We compare our method with three baseline methods, including Instruct2Nerf, DGE, and FRESCO.

ChatSim

We compare our method (right) with ChatSim (middle) for autonomous driving scene simulation. The original input image is shown on the left.

Pika1.5

We compare our method (right) with the visual effect, Pikaffect, created by Pika1.5 (middle). The original input image for Pika1.5 is shown on the left. Our method generates more localized and physically accurate visual effects.

AutoVFX framework

Our instruction-guided video editing framework consists of three main modules: (1) 3D Scene Modeling (left), which integrates 3D reconstruction and scene understanding models; (2) Program Generation (middle), where LLMs generate editing programs based on user instructions; and (3) VFX Modules (right), which include predefined functions specialized for various editing tasks. These components are integrated with a physically-based simulation and rendering engine (e.g., Blender) to generate the final video.

References

  1. Haque, Ayaan, Matthew Tancik, Alexei A. Efros, Aleksander Holynski, and Angjoo Kanazawa. Instruct-nerf2nerf: Editing 3d scenes with instructions. In ICCV 2023. [code]
  2. Yang, Shuai, Yifan Zhou, Ziwei Liu, and Chen Change Loy. FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation. In CVPR 2024. [code]
  3. Chen, Minghao, Iro Laina, and Andrea Vedaldi. DGE: Direct gaussian 3d editing by consistent multi-view editing. In ECCV 2024. [code]

Citation

If you find our project useful, please consider citing:
AخA
 
@article{hsu2024autovfx,
    title={AutoVFX: Physically Realistic Video Editing from Natural Language Instructions},
    author={Hsu, Hao-Yu and Lin, Zhi-Hao and Zhai, Albert and Xia, Hongchi and Wang, Shenlong},
    journal={arXiv preprint arXiv:2411.02394},
    year={2024}
}

Acknowledgements

This project is supported by the Intel AI SRS gift, Meta research grant, the IBM IIDAI Grant and NSF Awards #2331878, #2340254, #2312102, #2414227, and #2404385. Hao-Yu Hsu is supported by Siebel Scholarship. We greatly appreciate the NCSA for providing computing resources. We thank Derek Hoiem, Sarita Adve, Benjamin Ummenhofer, Kai Yuan, Micheal Paulitsch, Katelyn Gao, Quentin Leboutet for helpful discussions.
The website template was borrowed from ClimateNeRF, RefNeRF, and Nerfies.