AutoVFX

¹University of Illinois at Urbana-Champaign

Best viewed in Google Chrome.

Abstract

Modern visual effects (VFX) software has made it possible for skilled artists to create imagery of virtually anything. However, the creation process remains laborious, complex, and largely inaccessible to everyday users. In this work, we present AutoVFX, a framework that automatically creates realistic and dynamic VFX videos from a single video and natural language instructions. By carefully integrating neural scene modeling, LLM-based code generation, and physical simulation, AutoVFX is able to provide physically-grounded, photorealistic editing effects that can be controlled directly using natural language instructions. We conduct extensive experiments to validate AutoVFX's efficacy across a diverse spectrum of videos and instructions. Quantitative and qualitative results suggest that AutoVFX outperforms all competing methods by a large margin in generative quality, instruction alignment, editing versatility, and physical plausibility.

Dynamic visual effects on videos

Gardenverse

Throw a basketball with fire towards vase with flowers and break the vase with collision.

Indoor scenes

Insert an animated dragon moving above and around the floor.

Outdoor & autonomous driving scenes

Insert a physics-enabled Benz G 20 meters in front of us with random 2D rotation. Add a Ferriari moving forward.

Comparison to related works

Baselines

We compare our method with three baseline methods, including Instruct2Nerf, DGE, and FRESCO.

Add a cupcake on the metal plate.
Replace the bulldozer with a birthday cake at the same position.
Simulate books falling from the sofa.
Setup a camp fire in the middle of the floor.
Make the vase with flowers to be like a mirror.
Insert an animated Goku figurine on the ground and make it on fire.
Insert an animated pikachu on the table.
Drop 5 basketballs on the table.

ChatSim

We compare our method (right) with ChatSim (middle) for autonomous driving scene simulation. The original input image is shown on the left.

Add a Audi moving foward on another lane.

Making a police car chasing behind a tesla roadster in front of us.

Add a Benz S driving towards us.

Pika1.5

We compare our method (right) with the visual effect, Pikaffect, created by Pika1.5 (middle). The original input image for Pika1.5 is shown on the left. Our method generates more localized and physically accurate visual effects.

Pikaffect: Crumbling effect.

Pikaffect: Melting effect.

Text Prompt: "Drop 5 basketballs on the table."

AutoVFX framework

Our instruction-guided video editing framework consists of three main modules: (1) 3D Scene Modeling (left), which integrates 3D reconstruction and scene understanding models; (2) Program Generation (middle), where LLMs generate editing programs based on user instructions; and (3) VFX Modules (right), which include predefined functions specialized for various editing tasks. These components are integrated with a physically-based simulation and rendering engine (e.g., Blender) to generate the final video.

References

Haque, Ayaan, Matthew Tancik, Alexei A. Efros, Aleksander Holynski, and Angjoo Kanazawa. Instruct-nerf2nerf: Editing 3d scenes with instructions. In ICCV 2023. [code]
Yang, Shuai, Yifan Zhou, Ziwei Liu, and Chen Change Loy. FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation. In CVPR 2024. [code]
Chen, Minghao, Iro Laina, and Andrea Vedaldi. DGE: Direct gaussian 3d editing by consistent multi-view editing. In ECCV 2024. [code]

Citation

If you find our project useful, please consider citing:

            
AخA
 
                      @article{hsu2024autovfx,
                          title={AutoVFX: Physically Realistic Video Editing from Natural Language Instructions},
                          author={Hsu, Hao-Yu and Lin, Zhi-Hao and Zhai, Albert and Xia, Hongchi and Wang, Shenlong},
                          journal={arXiv preprint arXiv:2411.02394},
                          year={2024}
                      }

Acknowledgements

This project is supported by the Intel AI SRS gift, Meta research grant, the IBM IIDAI Grant and NSF Awards #2331878, #2340254, #2312102, #2414227, and #2404385. Hao-Yu Hsu is supported by Siebel Scholarship. We greatly appreciate the NCSA for providing computing resources. We thank Derek Hoiem, Sarita Adve, Benjamin Ummenhofer, Kai Yuan, Micheal Paulitsch, Katelyn Gao, Quentin Leboutet for helpful discussions.
The website template was borrowed from ClimateNeRF, RefNeRF, and Nerfies.

AutoVFX: Physically Realistic Video Editing from Natural Language Instructions