- 1University of Illinois at Urbana-Champaign
Best viewed in Google Chrome.
Abstract
Modern visual effects (VFX) software has made it possible for skilled artists
to create imagery of virtually anything. However, the creation process remains
laborious, complex, and largely inaccessible to everyday users. In this work, we
present AutoVFX, a framework that automatically creates realistic and dynamic
VFX videos from a single video and natural language instructions. By carefully
integrating neural scene modeling, LLM-based code generation, and physical simulation,
AutoVFX is able to provide physically-grounded, photorealistic editing effects
that can be controlled directly using natural language instructions. We conduct
extensive experiments to validate AutoVFX's efficacy across a diverse spectrum
of videos and instructions. Quantitative and qualitative results suggest that
AutoVFX outperforms all competing methods by a large margin in generative quality,
instruction alignment, editing versatility, and physical plausibility.
Dynamic visual effects on videos
Gardenverse
Throw a basketball with fire towards vase with flowers and break the vase with collision.
Indoor scenes
Insert an animated dragon moving above and around the floor.
Outdoor & autonomous driving scenes
Insert a physics-enabled Benz G 20 meters in front of us with random 2D rotation. Add a Ferriari moving forward.
Comparison to related works
Baselines
ChatSim
We compare our method (right) with ChatSim (middle) for autonomous driving scene simulation.
The original input image is shown on the left.
Pika1.5
We compare our method (right) with the visual effect, Pikaffect, created by Pika1.5 (middle). The original input image for Pika1.5 is shown on the left.
Our method generates more localized and physically accurate visual effects.
AutoVFX framework
Our instruction-guided video editing framework consists of three main modules:
(1) 3D Scene Modeling (left), which integrates 3D reconstruction and scene understanding models;
(2) Program Generation (middle), where LLMs generate editing programs based on user instructions; and
(3) VFX Modules (right), which include predefined functions specialized for various editing tasks.
These components are integrated with a physically-based simulation and rendering engine (e.g., Blender) to generate the final video.
References
- Haque, Ayaan, Matthew Tancik, Alexei A. Efros, Aleksander Holynski, and Angjoo Kanazawa. Instruct-nerf2nerf: Editing 3d scenes with instructions. In ICCV 2023. [code]
- Yang, Shuai, Yifan Zhou, Ziwei Liu, and Chen Change Loy. FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation. In CVPR 2024. [code]
- Chen, Minghao, Iro Laina, and Andrea Vedaldi. DGE: Direct gaussian 3d editing by consistent multi-view editing. In ECCV 2024. [code]
Citation
If you find our project useful, please consider citing:AخA
@article{hsu2024autovfx,
title={AutoVFX: Physically Realistic Video Editing from Natural Language Instructions},
author={Hsu, Hao-Yu and Lin, Zhi-Hao and Zhai, Albert and Xia, Hongchi and Wang, Shenlong},
journal={arXiv preprint arXiv:2411.02394},
year={2024}
}
Acknowledgements
This project is supported by the Intel AI SRS gift, Meta research grant, the IBM IIDAI Grant and NSF Awards #2331878, #2340254, #2312102, #2414227, and #2404385. Hao-Yu Hsu is supported by Siebel Scholarship. We greatly appreciate the NCSA for providing computing resources. We thank Derek Hoiem, Sarita Adve, Benjamin Ummenhofer, Kai Yuan, Micheal Paulitsch, Katelyn Gao, Quentin Leboutet for helpful discussions.The website template was borrowed from ClimateNeRF, RefNeRF, and Nerfies.